Spatial Weight Prior for Sparsity and other benefits.
Hypothesis
I had this idea and thought I'd play around with a few version of it, This is a compiled report put together (with extensive oversight) about all the tests and how the my idea evolved, pretty interesting.
Idea: Add a fixed (non-trained) spatial weight matrix that biases connections based on neuron position. Nearby neurons connect more strongly than distant ones.
W_effective = W_learned * spatial_kernel
Why interesting: - Neural networks have permutation symmetry so neurons can be reordered without changing the function - This makes comparing networks trained from different inits hard (need to find the right permutation) - Adding a spatial prior breaks this symmetry so that position now matters - Networks might converge to more similar internal structures
Expected effects: - Breaks permutation symmetry, shuffling neurons now hurts performance - Encourages local processing (like a soft convolution) - Networks trained from different inits might have more comparable representations - Possible cost: reduced capacity / slower learning?
Kernels to test:
- none — standard fully connected (baseline)
- linear — linear decay with distance: 1 - |i - j| / max_dist
Log
v1 - Linear vs None Kernel
Date: 2025-12-29
Setup: - MNIST, simple MLP with SpatialLinear layers - Train 2 networks per kernel type (different seeds) - Compare: accuracy + weight similarity between the two networks
Results:
| Kernel | Avg Accuracy | Weight Similarity (cosine) |
|---|---|---|
| none | 97.69% | 0.018 |
| linear | 97.56% | 0.007 |
Observations: - Accuracy nearly identical — spatial prior doesn't hurt learning - Weight similarity near ZERO for both — networks learn completely different weights - The spatial prior did NOT increase weight convergence
Analysis:
The hypothesis was wrong. Breaking permutation symmetry doesn't make networks converge to similar solutions because:
- Many local solutions exist — Even with locality constraints, there are many ways to learn "detect edges in this region"
- Symmetry isn't the main problem — The issue isn't that neurons are permuted; it's that the loss landscape has many equivalent minima
- Weights ≠ representations — Networks might learn similar functions despite different weights
Key insight: The spatial kernel constrains how information flows (locality), but not what patterns to learn. Two networks can both learn "local features" but with completely different local features.
v2 - Spatial as Magnitude Normalizer
Date: 2025-12-29
Idea: What if spatial kernel defines MAGNITUDE and learned weights define DIRECTION?
W_direction = W_learned / ||W_learned|| # per row
W_effective = spatial_kernel * W_direction * scale
Results:
| Kernel | Avg Accuracy | Weight Similarity |
|---|---|---|
| none | 97.43% | 0.048 |
| linear | 97.22% | 0.038 |
Observations: - Slightly increased weight similarity (~4-5% vs ~1-2% in v1) - Small accuracy cost - More constraint → slightly more similar structures
v3 - Multiple Kernel Shapes
Date: 2025-12-29
Setup: Test different spatial kernels with magnitude normalization on MNIST.
Kernels:
- none — all ones (baseline)
- linear — 0.1 + 0.9 * (1 - dist)
- exponential — 0.05 + 0.95 * exp(-3 * dist)
- gaussian — 0.05 + 0.95 * exp(-dist² / 0.18)
- inverse — 0.1 + 0.9 * dist (far = strong, nearby = weak)
Results:
| Kernel | Avg Accuracy | Weight Similarity |
|---|---|---|
| inverse | 97.66% | 0.041 |
| none | 97.43% | 0.048 |
| linear | 97.22% | 0.038 |
| exponential | 97.20% | 0.020 |
| gaussian | 97.09% | 0.018 |
Observations: - Inverse kernel performed best on MNIST - Sharper locality (exponential, gaussian) performed worse - This is MNIST-specific — different pattern on CIFAR-10
v4 - Annealed Kernels
Date: 2025-12-29
Idea: What if the spatial kernel transitions over training time?
Modes:
- none_to_linear — Start unconstrained, gradually add locality
- linear_to_none — Start with locality, gradually relax
Results (MNIST, 15 epochs):
| Mode | Avg Accuracy | Weight Similarity |
|---|---|---|
| none_to_linear | 96.62% | 0.057 |
| linear_to_none | 96.39% | 0.051 |
| none (static) | 96.00% | 0.052 |
| linear (static) | 95.88% | 0.044 |
Observations:
- Both annealing modes beat their static counterparts
- none_to_linear (add structure) slightly better than linear_to_none (remove structure)
I want to explore more about this later... but for now conceptually I like going from none to the function more.
v5 - Comprehensive Test on CIFAR-10
Date: 2025-12-30
Setup: Harder problem (CIFAR-10), all kernel variants, 12 epochs, 2 seeds.
Note: Removed magnitude normalization (was causing issues with large input sizes). Went back to simple multiplication: W_effective = W_learned * kernel.
Results:
| Mode | Avg Accuracy | Weight Similarity |
|---|---|---|
| exponential | 50.53% | 0.0424 |
| none_to_exponential | 50.16% | 0.0405 |
| linear | 49.65% | 0.0447 |
| none_to_linear | 49.63% | 0.0444 |
| linear_to_none | 48.63% | 0.0464 |
| none | 48.16% | 0.0495 |
| exponential_to_none | 47.83% | 0.0503 |
Observations:
- All spatial kernels beat baseline none on CIFAR-10
- Exponential (sharper locality) outperforms linear (softer locality)
- Pattern: none_to_X helps, X_to_none hurts
- Opposite of MNIST where inverse (anti-locality) won
Open questions: - Why does MNIST prefer anti-locality while CIFAR-10 prefers strong locality? - Is the annealing benefit from curriculum-style learning or something else? - Would longer training change the rankings?
v6 - Shuffled vs Structured Kernel
Date: 2025-12-30
Question: Does the benefit come from having any consistent structure, or specifically from index-aligned locality?
Setup: - Take the best kernel from v5 (exponential) - Create "shuffled" version: same values, randomly permuted positions - Two different shuffle seeds (1000, 2000) to check if specific pattern matters - CIFAR-10, 12 epochs, 3 seeds per config
Results:
| Config | Accuracy | Weight Similarity |
|---|---|---|
| none (baseline) | 48.00% ± 0.46% | 0.0486 |
| exponential (structured) | 50.48% ± 0.35% | 0.0256 |
| exponential_shuffled_1000 | 50.26% ± 0.44% | 0.0237 |
| exponential_shuffled_2000 | 50.04% ± 0.67% | 0.0015 |
Observations: - Shuffled kernels beat baseline (~2% improvement) - Shuffled ≈ structured exponential (only ~0.3% gap, within noise) - Both shuffle seeds performed similarly - The specific random pattern doesn't seem to matter
What this suggests (tentatively): - The benefit is not specifically from index-aligned locality - Having any consistent heterogeneous connectivity pattern helps - The structured vs shuffled distinction may not be meaningful for performance
What we can't conclude yet: - Why this helps (regularization? optimization landscape? something else?) - Whether this generalizes to other architectures or tasks - Whether the effect persists with longer training
v7 - Multi-Task Weight Sharing with Spatial Scaffolding
Date: 2025-12-30
Question: Can decaying spatial kernels act as "scaffolding" to help one set of weights learn two completely different tasks better than standard multi-task learning?
Setup:
- Two very different tasks: CIFAR-10 (image classification) + IMDB (sentiment analysis)
- Shared hidden layers with task-specific input/output adapters
- Three conditions:
- separate: Independent networks (upper bound)
- no_spatial: Standard multi-task learning, no spatial kernels
- spatial_decay: Different spatial kernels per task, decaying to uniform
- Combined loss training (both tasks every step)
- 12 epochs, 2 seeds per config
Results:
| Model | CIFAR-10 | IMDB |
|---|---|---|
| separate | 46.92% | 83.29% |
| no_spatial | 45.94% | 82.40% |
| spatial_decay | 45.25% | 82.61% |
Observations:
- Both multi-task approaches achieved ~95-99% of separate network performance
- spatial_decay performed about the same as no_spatial
- No clear benefit from spatial scaffolding in this setup
Honest Assessment:
The finding that "shared weights can learn both tasks" is not novel — this is standard multi-task learning. What we were testing was whether spatial kernel scaffolding would improve over standard multi-task learning.
Answer: Not clearly.
The hypothesis was: task-specific spatial kernels during training → better task separation → decaying to uniform allows convergence → scaffolding helps.
The result: spatial scaffolding performed similarly to standard multi-task (no_spatial). The scaffolding didn't provide clear benefits.
Possible reasons: 1. 12 epochs isn't enough to see the benefit 2. The decay schedule was wrong (linear may not be optimal) 3. The architecture was too simple 4. Spatial scaffolding just doesn't help for this problem 5. The kernel choices (exponential vs shuffled) weren't different enough
What remains interesting from this project: - v5: Spatial kernels beat uniform baseline on CIFAR-10 (+2-3%) - v6: Shuffled ≈ structured — the benefit is from heterogeneous connectivity, not specifically locality - v7: Standard multi-task learning works well; spatial scaffolding doesn't add clear value
v8 - Capped vs Multiplicative Spatial Kernels
Date: 2025-12-30
Question: What if spatial kernels define magnitude caps instead of multipliers?
Two approaches:
- Multiply: W_effective = W_learned * spatial_kernel (previous experiments)
- Capped: W_effective = clamp(W_learned, -spatial_kernel, +spatial_kernel)
With multiplication, weak spatial values scale down both the weight AND its gradient. With capping, weak spatial values limit the allowed range, but gradients are unscaled within that range.
Setup: - CIFAR-10, 12 epochs, 3 seeds per config - Compare both approaches with: none, exponential, exponential_shuffled
Results:
| Config | Mode | Kernel | Accuracy |
|---|---|---|---|
| mult_none | multiply | none | 48.00% ± 0.46% |
| mult_exp | multiply | exponential | 50.48% ± 0.35% |
| mult_exp_shuf | multiply | exponential_shuffled | 50.26% ± 0.44% |
| cap_none | capped | none | 48.00% ± 0.46% |
| cap_exp | capped | exponential | 47.89% ± 0.72% |
| cap_exp_shuf | capped | exponential_shuffled | 47.96% ± 0.66% |
Observations:
- With none kernel: both approaches identical (expected — uniform kernel means no constraint)
- With spatial kernels: multiply beats capped by ~2.5%
- Capped spatial kernels don't beat baseline — they perform same as no kernel
- The benefit from spatial kernels disappears when using capping instead of multiplication
What this tells us: - The benefit of spatial kernels is tied to the multiplicative formulation - Hard magnitude constraints (capping) don't provide the same benefit - Something about the multiplication approach matters beyond just limiting weight magnitudes
What we still don't know: - Whether it's the gradient scaling, the soft constraint, or something else - Whether different cap formulations (soft boundaries) would change results
v9 - Spatial Kernel as Decay Target
Date: 2025-12-30
Question: What if we decay weights toward spatial kernel values instead of toward zero?
Idea:
- Standard weight decay: W → 0
- This experiment: W → spatial_kernel
The spatial kernel becomes the "equilibrium" connectivity pattern. Weights can deviate but are regularized back toward it.
Setup:
- Add loss term: decay_weight * sum((W - spatial_kernel)^2)
- CIFAR-10, 12 epochs, 3 seeds per config
- decay_weight = 0.0001
Results:
| Config | Decay Target | Accuracy |
|---|---|---|
| decay_to_zero | zero | 47.68% ± 0.29% |
| decay_to_exp | exponential | 26.61% ± 0.61% |
| decay_to_exp_shuf | shuffled | 27.18% ± 0.95% |
Observations: - Decaying toward spatial kernel values catastrophically hurts performance (~21% worse) - Both structured and shuffled decay targets fail equally badly - Performance is near random guessing (10% baseline for 10 classes)
Why this fails: - The exponential kernel values are all positive (0.05 to 1.0) - This forces weights toward positive values, but networks need both positive and negative weights - Spatial kernels make sense as scaling factors (biasing connection strength) - They don't make sense as target values (what the weights should be)
What this tells us: - The benefit from v6 (multiplicative approach) comes from biasing how much connections can contribute - Not from establishing what the actual weight values should be - Spatial kernels are about relative influence, not absolute values
v10 - Progressive Sharpening
Date: 2025-12-31
Question: What if we start with a broad kernel and progressively sharpen it during training?
Idea:
- Use exponential kernel: kernel = 0.05 + 0.95 * exp(-k * dist)
- Anneal k from low (broad) to high (sharp) over training
- Network learns with more connections early, then is forced to compress into fewer connections
Setup: - CIFAR-10, 12 epochs, 3 seeds per config - Compare static k=3, static k=15, and various sharpening schedules
Results:
| Config | k schedule | Accuracy | Final Sparsity |
|---|---|---|---|
| static_k3 | k=3 | 50.34% ± 0.28% | 0% |
| sharpen_3to15 | 3→15 | 52.59% ± 0.34% | 64.7% |
| sharpen_3to10 | 3→10 | 51.66% ± 0.41% | 48.0% |
| sharpen_1to15 | 1→15 | 51.89% ± 0.52% | 64.7% |
| static_k15 | k=15 | 49.21% ± 0.39% | 64.7% |
(Sparsity = % of kernel values < 0.1)
Observations: - Progressive sharpening beats both static baselines - Best config: sharpen_3to15 at 52.59% (+2.25% over static_k3) - Starting sharp (k=15) hurts; ending sharp helps - Achieves ~65% effective sparsity while improving accuracy
What this suggests: - The annealing schedule matters: broad → sharp works better than static - Network can adapt to increasing constraints during training - This creates a naturally sparse connectivity pattern
v11 - High Sparsity Progressive Sharpening
Date: 2025-12-31
Question: How far can we push sparsity with progressive sharpening before accuracy degrades?
Setup: - Build on v10's best config (sharpen_3to15) - Test higher k_end values: 30, 50, 75, 100 - Target: 95%+ sparsity
Results:
| Config | k schedule | Accuracy | Final Sparsity |
|---|---|---|---|
| sharpen_3to15 | 3→15 | 52.59% ± 0.34% | 64.7% |
| sharpen_3to30 | 3→30 | 52.91% ± 0.35% | 81.4% |
| sharpen_3to50 | 3→50 | 53.03% ± 0.16% | 88.6% |
| sharpen_3to75 | 3→75 | 52.51% ± 0.31% | 92.3% |
| sharpen_3to100 | 3→100 | 52.57% ± 0.28% | 94.2% |
Convergence speed comparison (test accuracy by epoch):
| Epoch | Baseline (none) | Exponential | Sharpen 3→50 |
|---|---|---|---|
| 0 | 40.1% | 41.7% | 41.4% |
| 2 | 43.4% | 45.6% | 45.8% |
| 4 | 46.2% | 47.9% | 48.7% |
| 6 | 46.3% | 48.8% | 50.4% |
| 11 | 48.0% | 50.5% | 53.0% |
Observations: - Flat accuracy-sparsity tradeoff: 94.2% sparsity with <0.1% accuracy loss vs 64.7% sparsity - Best config: sharpen_3to50 at 53.03% with 88.6% sparsity - Spatial kernel networks converge faster: reach 45% accuracy ~1-2 epochs earlier than baseline - Gap widens over training: +2% at epoch 2, +5% at epoch 11
Summary of Key Results
- Spatial kernels beat uniform baseline on CIFAR-10 — +5% accuracy (48% → 53%)
- Shuffled ≈ structured kernels (v6) — The benefit comes from heterogeneous connectivity, not specifically from index-aligned locality
- Multiplicative formulation is essential (v8) — Capping doesn't work
- Spatial kernels as decay targets fail (v9) — They work as scaling factors, not target values
- Progressive sharpening achieves high sparsity (v10-11) — 94% sparsity with maintained accuracy
- Faster convergence — Spatial kernel networks reach accuracy thresholds earlier than baseline
The end (: