Variance-transfer initialization#
This section describes the variance-transfer initialization scheme from [Yuan2023], adapted to the GroMo codebase. It covers two independent mechanisms – neuron pairing for function preservation and rescaling strategies for variance stability – and their combined analysis.
Setup: ResNet BasicBlock structure#
A Conv2dGrowingBlock consists of two convolution layers with a residual
connection:
x ───────────────────────────────────────────────────── (+) ── out
| ^
└─> [PreAct] -> [Conv1: W1] -> [MidAct] -> [Conv2: W2] ─┘
with:
\(W_1 \in \mathbb{R}^{h \times C_{\text{in}} \times k \times k}\) (first conv,
in_channels\(\to\)hidden_channels)\(W_2 \in \mathbb{R}^{C_{\text{out}} \times h \times k \times k}\) (second conv,
hidden_channels\(\to\)out_channels)\(h\) is the current hidden channel count
\(C_{\text{in}}, C_{\text{out}}\) are fixed by the residual connection
Growth adds neurons to the hidden dimension \(h \to h'\):
\(W_1\) grows along its output dimension (more output channels)
\(W_2\) grows along its input dimension (more input channels)
Part 1: Neuron pairing – \((V,V)/(Z,-Z)\)#
This section describes how new neurons are added, independently of how existing weights are rescaled (Part 2).
Structure#
We add \(\Delta h\) hidden channel pairs so that \(h_{t+1} = h_t + 2\Delta h\). Each new neuron is duplicated into a pair whose net contribution cancels at initialization.
New weight matrices:
Assembled layers after growth (where \(\alpha, \beta\) are rescaling factors from Part 2):
Function preservation#
The new neurons produce activations \(\sigma(V * x)\), duplicated identically. The second layer reads them via \((Z, -Z)\):
Therefore the block output is preserved up to rescaling:
Exact function preservation requires \(\alpha\,\beta = 1\).
Part 2: Rescaling strategies#
Three strategies are supported for choosing the rescaling factors \(\alpha\) (Conv1) and \(\beta\) (Conv2). They are independent of the neuron-pairing mechanism.
Strategy A: "default_vt"#
Default strategy from [Yuan2023] (Section 3.1, Table 1). Rescaling factors depend only on the width ratio, not on actual weight statistics.
Conv1’s input (\(C_{\text{in}}\)) is not extended, so Conv1 is not rescaled:
Strategy B: "vt_constraint_old_shape"#
From the paper Appendix (Theorem 1). Uses actual weight statistics to enforce \(\operatorname{Var}[W] = 1/\text{fan_in_old}\) after rescaling:
Strategy C: "vt_constraint_new_shape"#
Like Strategy B but targets \(1/\text{fan_in_new}\) instead of \(1/\text{fan\_in\_old}\):
Note: \(\alpha\) is the same as in Strategy B (Conv1’s fan-in does not change during growth). Only \(\beta\) differs: \(h_t\) (B) vs \(h_{t+1}\) (C) in the denominator.
Summary table#
Strategy |
\(\alpha\) (Conv1) |
\(\beta\) (Conv2) |
|---|---|---|
A: Default VT |
\(1\) |
\(\sqrt{h_t / h_{t+1}}\) |
B: VT old shape |
\(1 / \sqrt{C_{\text{in}} k^2\, \operatorname{Var}[W_1]}\) |
\(1 / \sqrt{h_t\, k^2\, \operatorname{Var}[W_2]}\) |
C: VT new shape |
\(1 / \sqrt{C_{\text{in}} k^2\, \operatorname{Var}[W_1]}\) |
\(1 / \sqrt{h_{t+1}\, k^2\, \operatorname{Var}[W_2]}\) |
BatchNorm adjustment#
When a layer’s weights are scaled by factor \(c\), the BatchNorm running statistics must be adjusted accordingly:
Part 3: Combined analysis#
This section analyses the resulting weight and activation variances after applying a rescaling strategy together with \((V,V)/(Z,-Z)\) neuron pairing.
Definitions#
Consider the forward pass through one BasicBlock after a growth step. Let \(x_{\text{pre}} = \sigma_{\text{pre}}(x)\) be the pre-activated input to Conv1.
Hidden activations \(u\) (output of Conv1, input to Conv2):
\(u'\): component from old weights: \(u' = \alpha\, W_1^{(t)} * x_{\text{pre}}\)
\(u''\): component from new weights: \(u'' = V * x_{\text{pre}}\) (duplicated as \((V, V)\))
After mid-activation: \(\sigma(u) = [\sigma(u'),\; \sigma(u''),\; \sigma(u'')]\)
Block output \(y\) (output of Conv2, before residual addition):
\(y' = \beta\, W_2^{(t)} \cdot \sigma(u')\) (old pathway)
\(y'' = Z \cdot \sigma(u'') + (-Z) \cdot \sigma(u'') = 0\) (new pathway cancels at init)
At initialization \(y'' = 0\), so \(y = y' = \alpha\,\beta\;\text{Block}_t(x)\).
Activation variance#
Assuming inputs have unit variance and weights are independent:
Hidden activations:
Block output (old pathway):
New pathway at init: \(\operatorname{Var}[y'']_{\text{init}} = 0\) (by \((Z,-Z)\) cancellation).
New pathway after first gradient step (symmetry broken):
Since \(y = y' + y''\) and the two contributions are independent:
Strategy A#
With \(\alpha = 1\) and \(\beta = \sqrt{h_t/h_{t+1}}\):
This equals 1 only if \(\operatorname{Var}[W_1^{(t)}] = 1/(C_{\text{in}} k^2)\) already holds. Strategy A preserves variance across growth steps only if the weights already have the correct variance.
Strategy B#
By construction:
At init \(\operatorname{Var}[y] = 1\) (exact). After one gradient step:
The excess shrinks as \(\Delta h / h_{t+1} \to 0\).
Merged weight variances:
Strategy C#
By construction:
At init \(\operatorname{Var}[y] = h_t/h_{t+1} < 1\). After one gradient step:
Merged weight variances:
Summary: resulting variances after growth#
Weight variances:
Strategy |
\(\operatorname{Var}[W_1^{(t+1)}]\) |
\(\operatorname{Var}[W_2^{(t+1)}]\) |
|---|---|---|
A |
depends on prior \(\operatorname{Var}[W_1^{(t)}]\) |
depends on prior \(\operatorname{Var}[W_2^{(t)}]\) |
B |
\(= 1/(C_{\text{in}} k^2)\) (exact) |
\(\approx 1/(h_{t+1} k^2)\) (approximate) |
C |
\(= 1/(C_{\text{in}} k^2)\) (exact) |
\(= 1/(h_{t+1} k^2)\) (exact) |
Activation variances:
\(\operatorname{Var}[y]_{\text{init}}\) |
\(\operatorname{Var}[y]_{\text{1st step}}\) |
Trade-off |
|
|---|---|---|---|
A |
depends on prior weights |
depends on prior weights |
no correction |
B |
\(1\) (exact) |
\(1 + 2\Delta h / h_{t+1}\) (> 1) |
stable at init, slight excess after |
C |
\(h_t / h_{t+1}\) (< 1) |
\(1\) (exact) |
small init deficit, exact after 1 step |
Implementation#
The variance-transfer features are exposed through
create_layer_extensions() via
two parameters:
rescaling: one ofNone,"default_vt","vt_constraint_old_shape","vt_constraint_new_shape"neuron_pairing: one ofNone,"vv_z_negz"
These can also be called independently as standalone methods for the FOGRO
growth path, where extensions are created by
compute_optimal_added_parameters and trimmed by
sub_select_optimal_added_parameters before rescaling and pairing are
applied:
All parameters default to None, preserving full backward compatibility.