[tx] General implementation of trainable Hyper Connections by tanmaysachan · Pull Request #1008 · NovaSky-AI/SkyRL

tanmaysachan · 2026-02-02T17:26:37Z

Addresses #952

This PR is a general implementation of Hyper connections.

This is supposed to be an extension like Lora, where the default case mimics a standard residual connection with identity mappings.

Default case - Trainable is false. Expansion rate is 1.

[edit] we now bypass this case entirely for a regular residual network.

H_res is a single value matrix [1]
H_pre and H_post are vectors of [1, 1, 1, ...] that result in no-op matmuls

For expansion rate > 1

H_res is initialized as identity of size nxn (n is the expansion rate)
H_pre is [1/n, 1/n, ...]
H_post is [1, 1, 1, ...]

These matrices preserve identity mapping. So expansion rate > 1 but untrainable still results in the the same outputs.

Todos

simplify rms integration - added elementwise_affine as a flag
Benchmark/ensure no regression for expansion_rate = 1 - minimal difference in step time when expansion rate is 1 and untrainable.

Future work

Fine tune on custom data with mHC + LoRA to see perf gains

gemini-code-assist

Code Review

This pull request introduces a general implementation of Hyper Connections as an extension to the transformer layers. The changes are mainly in tx/layers/connectors.py where the Connector module is defined, and in tx/models/deepseekv3.py to integrate it into the decoder layers.

My review found a couple of issues:

An unused trainable parameter in the Connector class which should be removed for clarity.
A bug in DeepseekV3Model when handling intermediate hidden states for expansion_rate > 1, where squeeze() is used incorrectly.

Overall, the implementation of the Hyper Connections logic seems to follow the intended pattern of pre/post processing around existing attention and MLP blocks. The changes are well-contained. Addressing the mentioned points will improve the robustness and clarity of the implementation.

gemini-code-assist · 2026-02-02T17:28:32Z

skyrl-tx/tx/models/deepseekv3.py

        for layer_idx, layer in enumerate(self.layers):
            if output_hidden_states:
-                all_hidden_states.append(hidden_states)
+                all_hidden_states.append(hidden_states.squeeze())


hidden_states.squeeze() is used here to process intermediate hidden states. This will only work correctly if expansion_rate is 1. For expansion_rate > 1, squeeze() will have no effect because the expansion dimension has size n > 1. This will result in appending a tensor with an incorrect shape (..., n, C) to all_hidden_states, which is inconsistent with other states and likely to cause issues downstream.

A more robust approach is to aggregate across the expansion dimension, for example by taking the mean.

Suggested change

all_hidden_states.append(hidden_states.squeeze())

all_hidden_states.append(hidden_states.mean(axis=-2))

skyrl-tx/tx/layers/connectors.py

tanmaysachan · 2026-02-02T19:08:15Z

skyrl-tx/tx/layers/layernorm.py

        self.eps = eps
        self.weight = Param(
-            size, dtype=dtype, kernel_init=nnx.with_partitioning(nnx.initializers.normal(), jax.P(None)), rngs=rngs
+            size, dtype=dtype, kernel_init=nnx.with_partitioning(nnx.initializers.ones_init(), jax.P(None)), rngs=rngs


Temporary, testing

https://docs.pytorch.org/docs/stable/generated/torch.nn.modules.normalization.RMSNorm.html

Torch also initalizes to one by default

Due to adapter indexing, ended up re-implementing norm in the connector layer itself - this change can be removed. But considering torch as the baseline, ones_init fits better still

… rms in mhc

pcmoritz · 2026-02-05T00:54:29Z

This looks very elegant, thanks a lot for putting it together! Have you tried to do any end-to-end runs yet / studied the performance, both in terms of learning dynamics / accuracy, as well as how much slowdown it incurs :)

tanmaysachan · 2026-02-05T01:08:24Z

Just waiting for the weekend to give it a spin 😅

I'll give Qwen0.6B a shot on an A/H100

pcmoritz · 2026-02-05T02:00:15Z

Sounds great! I'm putting together the 0.3.0 release at the moment, so it will probably need to wait then, but 0.3.1 should come relatively soon thereafter, so it is not a problem. I'll put a callout in the release blog anyways, if somebody wants to try it out, they can just apply the diff themselves given how simple this is :)

tanmaysachan · 2026-02-11T11:57:13Z

Did some analysis on the step times for each on Qwen 0.6B (on a 5060Ti)

Expansion rate as 1 does cause a hit to the average step time (about 0.3s slower, baseline has a step time of 2.1s vs 2.4s). An easy fix would be to just short circuit the entire thing for expansion rate = 1.

For expansion rate = 4, the step time was around 3.17s, so about 46% slower.

tanmaysachan · 2026-02-11T18:30:47Z

Loss plot for Qwen0.6B with an expansion rate = 4 max_lora_adapters=2, max_lora_rank=1.

(some more analysis todo)

devin-ai-integration

Devin Review found 1 new potential issue.

View 14 additional findings in Devin Review.

devin-ai-integration · 2026-02-12T15:47:59Z

skyrl-tx/tx/tinker/backends/jax.py

            """Compute full gradients, apply optimizer update, and reset accumulated grads."""
            optimizer.update(lora_params, accumulated_grads.get_mean(adapter_index))
-            return accumulated_grads.reset_adapter(adapter_index)
+            if global_optimizer is not None and self.has_global_trainables:
+                global_optimizer.update(global_params, global_accumulated_grads.get_mean())
+                global_accumulated_grads = global_accumulated_grads.reset()
+            return accumulated_grads.reset_adapter(adapter_index), global_accumulated_grads


🔴 Global optimizer updated with zero gradients on second adapter's optim_step

When multiple LoRA adapters are active, the shared global optimizer receives spurious zero-gradient updates, corrupting its Adam state.

Root Cause

In compute_grads_and_update (jax.py:531-536), the global optimizer is updated and the global accumulated gradients are reset unconditionally on every call:

if global_optimizer is not None and self.has_global_trainables: global_optimizer.update(global_params, global_accumulated_grads.get_mean()) global_accumulated_grads = global_accumulated_grads.reset()

Since optim_step is called once per adapter (jax.py:773-809), with two adapters the sequence is:

optim_step(adapter_1) → updates global optimizer with real mean gradients, resets global_accumulated_grads to zero

optim_step(adapter_2) → updates global optimizer again with get_mean() of the now-zeroed gradients (all zeros), resets again

The second zero-gradient update corrupts Adam's internal state:

First moments decay: m_t = β₁ · m_{t-1} + (1-β₁) · 0 — momentum decays toward zero

Second moments decay: v_t = β₂ · v_{t-1} + (1-β₂) · 0 — variance estimate shrinks

Step counter increments, affecting bias correction

Impact: Global trainable parameters (connectors) receive incorrect optimizer updates that degrade training quality, with severity proportional to the number of adapters.

Prompt for agents

The global optimizer should only be updated once per training iteration, not once per adapter. Currently in compute_grads_and_update (jax.py:531-536), the global optimizer is updated and global accumulated gradients are reset on every call, but optim_step is called once per adapter. Fix this by either: (1) tracking whether global grads have already been applied in this iteration and skipping if already done (e.g., check global_accumulated_grads.count > 0 before updating), or (2) decoupling the global optimizer step from the per-adapter optim_step so it runs exactly once per training iteration. Option (1) is simpler: guard the global optimizer update with a check like `if global_accumulated_grads.count > 0` before calling global_optimizer.update.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration

Devin Review found 2 new potential issues.

View 17 additional findings in Devin Review.

skyrl-tx/tx/layers/lora.py

devin-ai-integration · 2026-02-12T19:55:30Z

skyrl-tx/tx/layers/connectors.py

+    def _get_adapter_indices(self, batch_size: int, adapter_indices: jax.Array | None) -> jax.Array:
+        if adapter_indices is None:
+            return jnp.zeros((batch_size,), dtype=jnp.int32)
+        return adapter_indices.astype(jnp.int32)


🟡 LoRAConnector broken when max_lora_adapters=0 — indexing into 0-sized parameter arrays returns wrong values

When a model is created with max_lora_adapters=0 (e.g., tx/run/train.py:80), the LoRAConnector creates all parameter arrays with a first dimension of 0. When pre() or post() is called, _get_adapter_indices returns jnp.zeros((B,), dtype=jnp.int32), and _get_params indexes into these 0-sized arrays, producing zero-filled results instead of the identity-preserving values.

Detailed Explanation

Unlike LoRAMixin.apply_lora which short-circuits when max_lora_adapters == 0 (lora.py:85), LoRAConnector has no such guard. When max_lora_adapters=0:

self.b_pre has shape (0, n), self.b_res has shape (0, n, n), etc.

_get_adapter_indices(B, None) returns jnp.zeros((B,)) at connectors.py:66

_get_params indexes into 0-sized arrays at connectors.py:71-80 — JAX clips out-of-bounds indices and returns zeros

In pre(): b_pre=0 → H_pre = sigmoid(0) = 0.5 instead of 1/n

In post(): b_res=0 → M = sinkhorn(zeros) produces a uniform 1/n matrix instead of identity

For the default expansion_rate=1, the impact on pre is masked by RMSNorm (the 0.5 scale cancels during normalization), and post still produces the correct residual + output. So the default case is approximately correct. However, for expansion_rate > 1 with max_lora_adapters=0, the connector would produce completely wrong outputs (uniform mixing instead of identity passthrough).

This path is exercised in production via tx/run/train.py:80 which uses max_lora_adapters=0.

Prompt for agents

Add a guard in LoRAConnector to handle the max_lora_adapters=0 case. The simplest approach is to add a check at the start of pre() and post() methods that bypasses the connector logic when max_lora_adapters is 0, falling back to identity behavior: pre() should return x.sum(axis=-2) / n (or equivalently the mean), and post() should return residual + output[..., None, :] (broadcasting output into the expansion dimension). Alternatively, ensure the constructor always creates at least 1 adapter slot (with identity initialization) even when max_lora_adapters=0, similar to how the default adapter_index=0 is used when adapter_indices is None.

Was this helpful? React with 👍 or 👎 to provide feedback.

tanmaysachan · 2026-02-12T22:07:36Z

1.7B Qwen, without expansion rate and with rate = 4 (roughly identical loss plots)
mHC times with training are about 93% higher than regular per step.

tanmaysachan · 2026-02-13T06:13:06Z

The loss differences are in a similar scale as to what is observed in the mHC paper.

Ground truth mHC analysis -

This is in preparation for merging #1008 and to make it easier to introduce metrics.  --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1191" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a>

pcmoritz · 2026-02-20T21:44:56Z

Thanks a lot for all the updates, I'll do the rest (already merged a PR that cleans things up a little #1191) :)

pcmoritz · 2026-02-20T23:05:02Z

skyrl-tx/tx/layers/connectors.py

+        C = hidden_dim
+
+        # Phi matrices are zero-initialized so that alpha * x @ 0 + bias = bias at init.
+        self.input_norm_weight = nnx.Param(jnp.ones((max_lora_adapters, n * C), dtype=dtype))


I'm curious, why did you make the RMSNorm per adapter and trainable? That seems wrong, we should probably just use the RMSNorm from the base model :) [I don't think any of the LoRA codes out there make the RMSNorm trainable]

Actually I think I misunderstood the code and you are doing the right thing :)

Sorry for going back on forth on this, but I think the actually correct implementation would be to pass the input norm parameters from the model to the constructor of LoRAConnector and use it for the normalization below and keep it non-trainable. It will be slightly redundant to apply the norm twice, but I think for code clarity that's fine for now (there is more optimizations to do anyways). Let me know about your thoughts, I'll give that a shot :)

Thanks for the change!
My layernorm change was around making the entire block including the norms trainable - but yeah if its something like lora that shouldn't be the case

pcmoritz · 2026-02-23T02:07:40Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a general implementation of trainable Hyper Connections (mHC) as an extension to LoRA. The changes are extensive, touching model configurations, layer implementations, backend logic, and utility functions for checkpointing. The core logic resides in the new LoRAConnector module, and its integration into models like DeepseekV3 and Qwen3 appears correct, properly handling the new stream dimension. The utility functions for state management and checkpointing have also been updated to support these new connector parameters. The tests provide a good foundation, but I have identified a few areas for improvement to enhance their robustness and address minor issues in the implementation.

gemini-code-assist · 2026-02-23T02:09:39Z

skyrl-tx/tests/layers/test_connectors.py

+        logits_e1 = np.asarray(model_e1.compute_logits(outputs_e1.last_hidden_state))
+        logits_e4 = np.asarray(model_e4.compute_logits(outputs_e4.last_hidden_state))
+
+    np.testing.assert_allclose(logits_e1, logits_e4, rtol=5e-2, atol=5e-2)


The tolerance for this assert_allclose is set to 5e-2 (5%), which is quite high for a test that aims to verify that the initial connector behavior keeps logits unchanged. This high tolerance might mask subtle deviations from the expected identity mapping. Consider lowering the tolerance (e.g., to 1e-5 or 1e-6) to ensure the identity initialization is working as precisely as intended.

gemini-code-assist · 2026-02-23T02:09:39Z

skyrl-tx/tests/layers/test_connectors.py

+class _TinyConnector(nnx.Module):
+    def __init__(self, max_adapters: int):
+        self.alpha_pre = nnx.Param(jnp.zeros((max_adapters, 4), dtype=jnp.float32))
+        self.phi_pre = nnx.Param(jnp.zeros((max_adapters, 4, 2), dtype=jnp.float32))


The _TinyConnector mock is incomplete and only contains a subset of the parameters from the actual LoRAConnector. This means that tests relying on this mock (like test_connector_adapter_slice_save_load_safetensors and test_connector_extract_insert_adapter_state_roundtrip) are not comprehensively verifying the serialization and state management logic for all connector parameters (e.g., b_pre, b_post, b_res, phi_post, phi_res, etc.).

To improve test coverage and ensure correctness, please expand _TinyConnector to include all parameters present in LoRAConnector and update the corresponding tests to check these additional parameters.

pcmoritz · 2026-02-23T03:14:54Z

skyrl-tx/tx/layers/connectors.py

+            return x[..., 0, :], x.reshape(B, T, n * C)
+
+        adapter_indices = self._get_adapter_indices(B, adapter_indices)
+        # Apply input_norm independently to each of the n streams.


The paper is not super clear on whether this is the right way to do it -- below equation (5) it says RMSNorm is applied to the last dimension C. In equation (7) it looks more like the RMSNorm is applied on the full n * C dimension. I chose the interpretation according to equation (5) since it is slightly more elegant and doesn't require changing the definition of the RMSNorm. Once a DeepSeek model is released that supports mHC, we can revisit this.

I think HC applies the RMSNorm to the last dimension C and mHC applies it to n * C, this becomes pretty apparent from equations (15) and (16). We should switch as soon as we have a model that natively supports mHC.

Did you observe anything off with trainable norms for n*C? Just curious as to why that wont fit

The actual difference in performance is pretty small, and once we have a model that is trained with mHC, it will have an RMSNorm weight of size n * C, so it will be very easy to adapt the current code to it (and there is no need for a trainable norm in the LoRA setting), so I feel like that's the better solution for now :)

Sounds good. For already pretrained models, the original HC paper is probably a better fit than mHC as it does have experimentations with HC as an augmentation (unlike mHC).

So RMS over C is probably better for the general case.

See #1008  --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1217" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a>

tanmaysachan added 3 commits February 2, 2026 22:53

Initial design

57d1881

Merge branch 'main' into tanmay/mhc

98c0994

Add comment

91d5e74

gemini-code-assist bot reviewed Feb 2, 2026

View reviewed changes

tanmaysachan added 2 commits February 2, 2026 23:58

Identity mapping for initial passthrough

24b82d7

Add trainable flag for freezing weights

874ab08

pcmoritz added the tx label Feb 2, 2026

Stray comment

975faa1

tanmaysachan commented Feb 2, 2026

View reviewed changes

tanmaysachan added 2 commits February 3, 2026 00:55

simplify

f685543

Add elementwise_affine flag to RMS to match pytorch impl. Replace raw…

e493ae5

… rms in mhc

tanmaysachan added 2 commits February 10, 2026 23:41

Merge branch 'main' into tanmay/mhc

b4ad7ad

Add to qwen, restore norms

066af09

This comment was marked as resolved.

Sign in to view

tanmaysachan added 3 commits February 12, 2026 09:52

remove rms changes

bcd4e41

make mhc trainable

744ce19

jax backend alternate path for global trainables

9f88cc5

devin-ai-integration bot reviewed Feb 12, 2026

View reviewed changes

tanmaysachan added 2 commits February 13, 2026 01:16

Make connectors part of Lora training flow

495bb38

stray change

c204a38

devin-ai-integration bot reviewed Feb 12, 2026

View reviewed changes

how even

7a2e921

This comment was marked as resolved.

Sign in to view

pcmoritz added 3 commits February 20, 2026 13:46

Merge branch 'main' into tanmay/mhc

5260f38

update

9353ecb

update

8086212

pcmoritz reviewed Feb 20, 2026

View reviewed changes

pcmoritz added 5 commits February 20, 2026 16:17

update

cb7a559

update

7752d22

cleanup

5da4994

update

07e5cce

simplify

f7a6756

This comment was marked as resolved.

Sign in to view

pcmoritz added 10 commits February 21, 2026 19:11

unify defaults

a0a3ce0

update

96089f3

update

4e60734

simplify

ac9b815

update

644b714

do not reorder

209dccc

update

72d4369

update

df3501f

canonicalize configuration

c546a8a

rename

74895d0

gemini-code-assist bot reviewed Feb 23, 2026

View reviewed changes

pcmoritz reviewed Feb 23, 2026

View reviewed changes

pcmoritz approved these changes Feb 23, 2026

View reviewed changes

pcmoritz merged commit 312fddb into NovaSky-AI:main Feb 23, 2026
1 of 3 checks passed

pcmoritz added a commit to pcmoritz/SkyRL that referenced this pull request Feb 25, 2026

[tx] Port NovaSky-AI#1008 to skyrl folder

6473634

pcmoritz mentioned this pull request Feb 25, 2026

[tx] Port https://github.com/NovaSky-AI/SkyRL/pull/1008 to skyrl folder #1217

Merged

	all_hidden_states.append(hidden_states.squeeze())
	all_hidden_states.append(hidden_states.mean(axis=-2))

Conversation

tanmaysachan commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Default case - Trainable is false. Expansion rate is 1.

For expansion rate > 1

Todos

Future work

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcmoritz commented Feb 5, 2026

Uh oh!

tanmaysachan commented Feb 5, 2026

Uh oh!

pcmoritz commented Feb 5, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

tanmaysachan commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tanmaysachan commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

devin-ai-integration bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

tanmaysachan commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tanmaysachan commented Feb 13, 2026

Uh oh!

pcmoritz commented Feb 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

pcmoritz commented Feb 23, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

tanmaysachan commented Feb 2, 2026 •

edited

Loading

tanmaysachan commented Feb 11, 2026 •

edited

Loading

tanmaysachan commented Feb 11, 2026 •

edited

Loading

tanmaysachan commented Feb 12, 2026 •

edited

Loading