rebel-adamw

Hypothesis

Idea: Modify AdamW so that a random percentage of parameters move opposite to what the gradient suggests each step. Call these "rebel" parameters.

Why it might be interesting: - Acts as structured exploration noise - some weights actively probe "wrong" directions - Could help escape sharp local minima (similar intuition to sharpness-aware minimization) - Might regularize by preventing all weights from jointly overfitting - Like "dropout for optimization" - a random subset rebels against the gradient

Expected effects: - Low rebel % (1-5%): Slight regularization, possibly helps generalization - Medium rebel % (10-20%): Slower convergence but maybe better final loss - High rebel % (30%+): Training diverges or fails to converge

Key design questions: 1. Flip full Adam update vs just the gradient component? 2. Fixed mask per step vs re-sample each step? 3. Should rebels still update momentum buffers normally (but flip sign at application)? 4. Anneal rebel % over training?

v1 approach: Simple implementation - re-sample rebel mask each step, flip the full update for rebels. Test on MNIST to see if it trains at all.

Log

v1 - Initial Implementation

Date: 2025-12-29

Changes: - Implement RebelAdamW optimizer (extends AdamW, flips update for random % of params) - Simple MNIST CNN test comparing rebel_pct=0%, 5%, 10%, 20% - Track train/test loss and accuracy curves

Results:

Rebel %	Final Train Acc	Final Test Acc
0%	99.17%	99.15%
5%	99.14%	99.09%
10%	99.10%	99.10%
20%	99.16%	99.24%

Key observations: 1. Training still works! Even with 20% of parameters actively fighting the gradient, the model converges normally 2. 20% rebels got best test accuracy - 0.09% higher than baseline (99.24% vs 99.15%) 3. Test loss is smoother with rebels - baseline (0%) shows more variance in test loss curve 4. No convergence slowdown - train loss curves are nearly identical across all rebel %

Interpretation: - The random sign-flipping acts as a form of regularization - Similar effect to adding noise, but structured (direction reversal vs magnitude perturbation) - MNIST may be too easy - need harder task to see if rebels help escape local minima

Next steps: - Try higher rebel % (30%, 40%, 50%) to find breaking point - Test on CIFAR-10 (harder task, might show clearer signal) - Try annealing rebel % down over training - Compare to just adding Gaussian noise to updates

v2 - TinyCNN + Convergence Speed Analysis

Date: 2025-12-29

Changes: - Much smaller TinyCNN (75K params vs ~1.2M) to make task harder - Added 30% rebel to find breaking point - Track epochs to reach accuracy thresholds (97%, 97.5%, 98%, 98.5%) - 20 epochs to see long-term behavior

Results - Final Accuracies:

Rebel %	Final Train	Final Test
0%	95.09%	98.87%
5%	95.61%	98.93%
10%	95.70%	98.85%
20%	95.27%	98.90%
30%	94.37%	98.75%

Results - Convergence Speed (epochs to reach threshold):

Rebel %	97%	97.5%	98%	98.5%
0%	2	2	4	7
5%	2	2	4	8
10%	2	2	4	7
20%	2	3	5	8
30%	2	4	6	12

Key observations: 1. 30% is too much - takes ~2x longer to reach 98.5% (epoch 12 vs 7), worse final accuracy 2. 5% is the sweet spot - best final test accuracy, same convergence speed as baseline 3. 0-10% behave similarly - no convergence penalty 4. 20% shows slight slowdown - 1 epoch slower to 97.5%, 98% 5. Train accuracy gap widens - rebels hurt training more visibly with smaller network

Interpretation: - With constrained capacity, rebels are more noticeable - 5% rebels: free regularization with no speed penalty - 30% rebels: too much noise, optimization struggles - Sweet spot appears to be 5-10%

Next steps: - Try annealing: start high (20%), decay to 0% - Test per-layer rebel % (more rebels in later layers?) - Compare to dropout (similar regularization mechanism?)

v3 - Fixed Rebel Mask (same weights rebel forever)

Date: 2025-12-29

Changes: - Instead of re-sampling rebel mask each step, create it ONCE and keep it fixed - Same weights are always rebels (permanent gradient ascent on those params!)

Results - Fixed Mask vs Random Mask:

Rebel %	v2 Random Test	v3 Fixed Test	Difference
0%	98.87%	98.87%	same
5%	98.93%	98.21%	-0.7%
10%	98.85%	96.25%	-2.6%
20%	98.90%	91.37%	-7.5%
30%	98.75%	32.75%	-66% COLLAPSED

Key observations: 1. Fixed mask is catastrophic - 30% fixed rebels = training collapse (32% ≈ random guessing) 2. Loss goes UP - with fixed mask, train loss increases over time (gradient ascent wins) 3. Accuracy declines - 10% fixed: starts at 97.9%, ends at 96.2% (getting worse!) 4. The randomness is essential - random re-sampling = noise/regularization; fixed mask = permanent damage

Interpretation: - With random mask: any given weight only occasionally rebels, can still learn overall - With fixed mask: same weights do gradient ASCENT forever, actively maximizing loss - This proves RebelAdamW's benefit comes from stochastic exploration, not from "some weights going backwards"

Conclusion: Random mask = structured noise regularization. Fixed mask = sabotage.

v4 - CIFAR-10 (Harder Task)

Date: 2025-12-29

Changes: - Switch from MNIST to CIFAR-10 (32x32 RGB images, 10 classes) - New CIFAR10CNN architecture (~582K params) - Data augmentation (RandomCrop with padding, RandomHorizontalFlip) - Quick focused test: 0% vs 5% rebels, 10 epochs

Results - Final Accuracies:

Rebel %	Final Train	Final Test
0%	68.87%	72.85%
5%	68.02%	72.55%

Results - Convergence Speed (epochs to reach threshold):

Rebel %	50%	60%	70%	75%
0%	2	3	7	N/A
5%	2	3	8	N/A

Key observations: 1. Results nearly identical - only 0.3% difference in final test accuracy 2. 5% slightly slower to 70% - epoch 8 vs epoch 7 (1 epoch slower) 3. Neither hit 75% in 10 epochs - need longer training 4. Train accuracy gap - 0% trains slightly better (68.87% vs 68.02%) 5. Test loss similar - both ending around 0.77-0.81

Interpretation: - On this harder task, 5% rebels show neither clear benefit nor harm - The regularization effect seen in MNIST may be masked by other factors (dropout, data aug) - Need longer training to see if rebels help in later epochs (escape local minima?)

Next steps: - Run longer test (30-50 epochs) to see if curves diverge - Try wider range of rebel % on CIFAR-10 - Compare with/without data augmentation (rebels might help more without aug)

TL/DR

As you can see from the automated report above, the ideas worked(ish) but I assume it is simply a form of noise that helps generalization, and the "rebel" aspect of it is really just a over-complicated way of doing this.