Back to Notes

Spatial Weight Prior for Sparsity and other benefits.

[email protected]

Hypothesis

I had this idea and thought I'd play around with a few version of it, This is a compiled report put together (with extensive oversight) about all the tests and how the my idea evolved, pretty interesting.

Idea: Add a fixed (non-trained) spatial weight matrix that biases connections based on neuron position. Nearby neurons connect more strongly than distant ones.

W_effective = W_learned * spatial_kernel

Why interesting: - Neural networks have permutation symmetry so neurons can be reordered without changing the function - This makes comparing networks trained from different inits hard (need to find the right permutation) - Adding a spatial prior breaks this symmetry so that position now matters - Networks might converge to more similar internal structures

Expected effects: - Breaks permutation symmetry, shuffling neurons now hurts performance - Encourages local processing (like a soft convolution) - Networks trained from different inits might have more comparable representations - Possible cost: reduced capacity / slower learning?

Kernels to test: - none — standard fully connected (baseline) - linear — linear decay with distance: 1 - |i - j| / max_dist


Log

v1 - Linear vs None Kernel

Date: 2025-12-29

Setup: - MNIST, simple MLP with SpatialLinear layers - Train 2 networks per kernel type (different seeds) - Compare: accuracy + weight similarity between the two networks

Results:

Kernel Avg Accuracy Weight Similarity (cosine)
none 97.69% 0.018
linear 97.56% 0.007

Observations: - Accuracy nearly identical — spatial prior doesn't hurt learning - Weight similarity near ZERO for both — networks learn completely different weights - The spatial prior did NOT increase weight convergence

Analysis:

The hypothesis was wrong. Breaking permutation symmetry doesn't make networks converge to similar solutions because:

  1. Many local solutions exist — Even with locality constraints, there are many ways to learn "detect edges in this region"
  2. Symmetry isn't the main problem — The issue isn't that neurons are permuted; it's that the loss landscape has many equivalent minima
  3. Weights ≠ representations — Networks might learn similar functions despite different weights

Key insight: The spatial kernel constrains how information flows (locality), but not what patterns to learn. Two networks can both learn "local features" but with completely different local features.


v2 - Spatial as Magnitude Normalizer

Date: 2025-12-29

Idea: What if spatial kernel defines MAGNITUDE and learned weights define DIRECTION?

W_direction = W_learned / ||W_learned||  # per row
W_effective = spatial_kernel * W_direction * scale

Results:

Kernel Avg Accuracy Weight Similarity
none 97.43% 0.048
linear 97.22% 0.038

Observations: - Slightly increased weight similarity (~4-5% vs ~1-2% in v1) - Small accuracy cost - More constraint → slightly more similar structures


v3 - Multiple Kernel Shapes

Date: 2025-12-29

Setup: Test different spatial kernels with magnitude normalization on MNIST.

Kernels: - none — all ones (baseline) - linear0.1 + 0.9 * (1 - dist) - exponential0.05 + 0.95 * exp(-3 * dist) - gaussian0.05 + 0.95 * exp(-dist² / 0.18) - inverse0.1 + 0.9 * dist (far = strong, nearby = weak)

Results:

Kernel Avg Accuracy Weight Similarity
inverse 97.66% 0.041
none 97.43% 0.048
linear 97.22% 0.038
exponential 97.20% 0.020
gaussian 97.09% 0.018

Observations: - Inverse kernel performed best on MNIST - Sharper locality (exponential, gaussian) performed worse - This is MNIST-specific — different pattern on CIFAR-10


v4 - Annealed Kernels

Date: 2025-12-29

Idea: What if the spatial kernel transitions over training time?

Modes: - none_to_linear — Start unconstrained, gradually add locality - linear_to_none — Start with locality, gradually relax

Results (MNIST, 15 epochs):

Mode Avg Accuracy Weight Similarity
none_to_linear 96.62% 0.057
linear_to_none 96.39% 0.051
none (static) 96.00% 0.052
linear (static) 95.88% 0.044

Observations: - Both annealing modes beat their static counterparts - none_to_linear (add structure) slightly better than linear_to_none (remove structure) I want to explore more about this later... but for now conceptually I like going from none to the function more.


v5 - Comprehensive Test on CIFAR-10

Date: 2025-12-30

Setup: Harder problem (CIFAR-10), all kernel variants, 12 epochs, 2 seeds.

Note: Removed magnitude normalization (was causing issues with large input sizes). Went back to simple multiplication: W_effective = W_learned * kernel.

Results:

Mode Avg Accuracy Weight Similarity
exponential 50.53% 0.0424
none_to_exponential 50.16% 0.0405
linear 49.65% 0.0447
none_to_linear 49.63% 0.0444
linear_to_none 48.63% 0.0464
none 48.16% 0.0495
exponential_to_none 47.83% 0.0503

Observations: - All spatial kernels beat baseline none on CIFAR-10 - Exponential (sharper locality) outperforms linear (softer locality) - Pattern: none_to_X helps, X_to_none hurts - Opposite of MNIST where inverse (anti-locality) won

Open questions: - Why does MNIST prefer anti-locality while CIFAR-10 prefers strong locality? - Is the annealing benefit from curriculum-style learning or something else? - Would longer training change the rankings?


v6 - Shuffled vs Structured Kernel

Date: 2025-12-30

Question: Does the benefit come from having any consistent structure, or specifically from index-aligned locality?

Setup: - Take the best kernel from v5 (exponential) - Create "shuffled" version: same values, randomly permuted positions - Two different shuffle seeds (1000, 2000) to check if specific pattern matters - CIFAR-10, 12 epochs, 3 seeds per config

Results:

Config Accuracy Weight Similarity
none (baseline) 48.00% ± 0.46% 0.0486
exponential (structured) 50.48% ± 0.35% 0.0256
exponential_shuffled_1000 50.26% ± 0.44% 0.0237
exponential_shuffled_2000 50.04% ± 0.67% 0.0015

Observations: - Shuffled kernels beat baseline (~2% improvement) - Shuffled ≈ structured exponential (only ~0.3% gap, within noise) - Both shuffle seeds performed similarly - The specific random pattern doesn't seem to matter

What this suggests (tentatively): - The benefit is not specifically from index-aligned locality - Having any consistent heterogeneous connectivity pattern helps - The structured vs shuffled distinction may not be meaningful for performance

What we can't conclude yet: - Why this helps (regularization? optimization landscape? something else?) - Whether this generalizes to other architectures or tasks - Whether the effect persists with longer training


v7 - Multi-Task Weight Sharing with Spatial Scaffolding

Date: 2025-12-30

Question: Can decaying spatial kernels act as "scaffolding" to help one set of weights learn two completely different tasks better than standard multi-task learning?

Setup: - Two very different tasks: CIFAR-10 (image classification) + IMDB (sentiment analysis) - Shared hidden layers with task-specific input/output adapters - Three conditions: - separate: Independent networks (upper bound) - no_spatial: Standard multi-task learning, no spatial kernels - spatial_decay: Different spatial kernels per task, decaying to uniform - Combined loss training (both tasks every step) - 12 epochs, 2 seeds per config

Results:

Model CIFAR-10 IMDB
separate 46.92% 83.29%
no_spatial 45.94% 82.40%
spatial_decay 45.25% 82.61%

Observations: - Both multi-task approaches achieved ~95-99% of separate network performance - spatial_decay performed about the same as no_spatial - No clear benefit from spatial scaffolding in this setup

Honest Assessment:

The finding that "shared weights can learn both tasks" is not novel — this is standard multi-task learning. What we were testing was whether spatial kernel scaffolding would improve over standard multi-task learning.

Answer: Not clearly.

The hypothesis was: task-specific spatial kernels during training → better task separation → decaying to uniform allows convergence → scaffolding helps.

The result: spatial scaffolding performed similarly to standard multi-task (no_spatial). The scaffolding didn't provide clear benefits.

Possible reasons: 1. 12 epochs isn't enough to see the benefit 2. The decay schedule was wrong (linear may not be optimal) 3. The architecture was too simple 4. Spatial scaffolding just doesn't help for this problem 5. The kernel choices (exponential vs shuffled) weren't different enough

What remains interesting from this project: - v5: Spatial kernels beat uniform baseline on CIFAR-10 (+2-3%) - v6: Shuffled ≈ structured — the benefit is from heterogeneous connectivity, not specifically locality - v7: Standard multi-task learning works well; spatial scaffolding doesn't add clear value


v8 - Capped vs Multiplicative Spatial Kernels

Date: 2025-12-30

Question: What if spatial kernels define magnitude caps instead of multipliers?

Two approaches: - Multiply: W_effective = W_learned * spatial_kernel (previous experiments) - Capped: W_effective = clamp(W_learned, -spatial_kernel, +spatial_kernel)

With multiplication, weak spatial values scale down both the weight AND its gradient. With capping, weak spatial values limit the allowed range, but gradients are unscaled within that range.

Setup: - CIFAR-10, 12 epochs, 3 seeds per config - Compare both approaches with: none, exponential, exponential_shuffled

Results:

Config Mode Kernel Accuracy
mult_none multiply none 48.00% ± 0.46%
mult_exp multiply exponential 50.48% ± 0.35%
mult_exp_shuf multiply exponential_shuffled 50.26% ± 0.44%
cap_none capped none 48.00% ± 0.46%
cap_exp capped exponential 47.89% ± 0.72%
cap_exp_shuf capped exponential_shuffled 47.96% ± 0.66%

Observations: - With none kernel: both approaches identical (expected — uniform kernel means no constraint) - With spatial kernels: multiply beats capped by ~2.5% - Capped spatial kernels don't beat baseline — they perform same as no kernel - The benefit from spatial kernels disappears when using capping instead of multiplication

What this tells us: - The benefit of spatial kernels is tied to the multiplicative formulation - Hard magnitude constraints (capping) don't provide the same benefit - Something about the multiplication approach matters beyond just limiting weight magnitudes

What we still don't know: - Whether it's the gradient scaling, the soft constraint, or something else - Whether different cap formulations (soft boundaries) would change results


v9 - Spatial Kernel as Decay Target

Date: 2025-12-30

Question: What if we decay weights toward spatial kernel values instead of toward zero?

Idea: - Standard weight decay: W → 0 - This experiment: W → spatial_kernel

The spatial kernel becomes the "equilibrium" connectivity pattern. Weights can deviate but are regularized back toward it.

Setup: - Add loss term: decay_weight * sum((W - spatial_kernel)^2) - CIFAR-10, 12 epochs, 3 seeds per config - decay_weight = 0.0001

Results:

Config Decay Target Accuracy
decay_to_zero zero 47.68% ± 0.29%
decay_to_exp exponential 26.61% ± 0.61%
decay_to_exp_shuf shuffled 27.18% ± 0.95%

Observations: - Decaying toward spatial kernel values catastrophically hurts performance (~21% worse) - Both structured and shuffled decay targets fail equally badly - Performance is near random guessing (10% baseline for 10 classes)

Why this fails: - The exponential kernel values are all positive (0.05 to 1.0) - This forces weights toward positive values, but networks need both positive and negative weights - Spatial kernels make sense as scaling factors (biasing connection strength) - They don't make sense as target values (what the weights should be)

What this tells us: - The benefit from v6 (multiplicative approach) comes from biasing how much connections can contribute - Not from establishing what the actual weight values should be - Spatial kernels are about relative influence, not absolute values


v10 - Progressive Sharpening

Date: 2025-12-31

Question: What if we start with a broad kernel and progressively sharpen it during training?

Idea: - Use exponential kernel: kernel = 0.05 + 0.95 * exp(-k * dist) - Anneal k from low (broad) to high (sharp) over training - Network learns with more connections early, then is forced to compress into fewer connections

Setup: - CIFAR-10, 12 epochs, 3 seeds per config - Compare static k=3, static k=15, and various sharpening schedules

Results:

Config k schedule Accuracy Final Sparsity
static_k3 k=3 50.34% ± 0.28% 0%
sharpen_3to15 3→15 52.59% ± 0.34% 64.7%
sharpen_3to10 3→10 51.66% ± 0.41% 48.0%
sharpen_1to15 1→15 51.89% ± 0.52% 64.7%
static_k15 k=15 49.21% ± 0.39% 64.7%

(Sparsity = % of kernel values < 0.1)

Observations: - Progressive sharpening beats both static baselines - Best config: sharpen_3to15 at 52.59% (+2.25% over static_k3) - Starting sharp (k=15) hurts; ending sharp helps - Achieves ~65% effective sparsity while improving accuracy

What this suggests: - The annealing schedule matters: broad → sharp works better than static - Network can adapt to increasing constraints during training - This creates a naturally sparse connectivity pattern


v11 - High Sparsity Progressive Sharpening

Date: 2025-12-31

Question: How far can we push sparsity with progressive sharpening before accuracy degrades?

Setup: - Build on v10's best config (sharpen_3to15) - Test higher k_end values: 30, 50, 75, 100 - Target: 95%+ sparsity

Results:

Config k schedule Accuracy Final Sparsity
sharpen_3to15 3→15 52.59% ± 0.34% 64.7%
sharpen_3to30 3→30 52.91% ± 0.35% 81.4%
sharpen_3to50 3→50 53.03% ± 0.16% 88.6%
sharpen_3to75 3→75 52.51% ± 0.31% 92.3%
sharpen_3to100 3→100 52.57% ± 0.28% 94.2%

Convergence speed comparison (test accuracy by epoch):

Epoch Baseline (none) Exponential Sharpen 3→50
0 40.1% 41.7% 41.4%
2 43.4% 45.6% 45.8%
4 46.2% 47.9% 48.7%
6 46.3% 48.8% 50.4%
11 48.0% 50.5% 53.0%

Observations: - Flat accuracy-sparsity tradeoff: 94.2% sparsity with <0.1% accuracy loss vs 64.7% sparsity - Best config: sharpen_3to50 at 53.03% with 88.6% sparsity - Spatial kernel networks converge faster: reach 45% accuracy ~1-2 epochs earlier than baseline - Gap widens over training: +2% at epoch 2, +5% at epoch 11


Summary of Key Results

  1. Spatial kernels beat uniform baseline on CIFAR-10 — +5% accuracy (48% → 53%)
  2. Shuffled ≈ structured kernels (v6) — The benefit comes from heterogeneous connectivity, not specifically from index-aligned locality
  3. Multiplicative formulation is essential (v8) — Capping doesn't work
  4. Spatial kernels as decay targets fail (v9) — They work as scaling factors, not target values
  5. Progressive sharpening achieves high sparsity (v10-11) — 94% sparsity with maintained accuracy
  6. Faster convergence — Spatial kernel networks reach accuracy thresholds earlier than baseline

The end (: