Back to Notes

Rebel AdamW Optimizer

[email protected]

rebel-adamw

Hypothesis

Idea: Modify AdamW so that a random percentage of parameters move opposite to what the gradient suggests each step. Call these "rebel" parameters.

Why it might be interesting: - Acts as structured exploration noise - some weights actively probe "wrong" directions - Could help escape sharp local minima (similar intuition to sharpness-aware minimization) - Might regularize by preventing all weights from jointly overfitting - Like "dropout for optimization" - a random subset rebels against the gradient

Expected effects: - Low rebel % (1-5%): Slight regularization, possibly helps generalization - Medium rebel % (10-20%): Slower convergence but maybe better final loss - High rebel % (30%+): Training diverges or fails to converge

Key design questions: 1. Flip full Adam update vs just the gradient component? 2. Fixed mask per step vs re-sample each step? 3. Should rebels still update momentum buffers normally (but flip sign at application)? 4. Anneal rebel % over training?

v1 approach: Simple implementation - re-sample rebel mask each step, flip the full update for rebels. Test on MNIST to see if it trains at all.


Log

v1 - Initial Implementation

Date: 2025-12-29

Changes: - Implement RebelAdamW optimizer (extends AdamW, flips update for random % of params) - Simple MNIST CNN test comparing rebel_pct=0%, 5%, 10%, 20% - Track train/test loss and accuracy curves

Results:

Rebel % Final Train Acc Final Test Acc
0% 99.17% 99.15%
5% 99.14% 99.09%
10% 99.10% 99.10%
20% 99.16% 99.24%

Key observations: 1. Training still works! Even with 20% of parameters actively fighting the gradient, the model converges normally 2. 20% rebels got best test accuracy - 0.09% higher than baseline (99.24% vs 99.15%) 3. Test loss is smoother with rebels - baseline (0%) shows more variance in test loss curve 4. No convergence slowdown - train loss curves are nearly identical across all rebel %

Interpretation: - The random sign-flipping acts as a form of regularization - Similar effect to adding noise, but structured (direction reversal vs magnitude perturbation) - MNIST may be too easy - need harder task to see if rebels help escape local minima

Next steps: - Try higher rebel % (30%, 40%, 50%) to find breaking point - Test on CIFAR-10 (harder task, might show clearer signal) - Try annealing rebel % down over training - Compare to just adding Gaussian noise to updates

v2 - TinyCNN + Convergence Speed Analysis

Date: 2025-12-29

Changes: - Much smaller TinyCNN (75K params vs ~1.2M) to make task harder - Added 30% rebel to find breaking point - Track epochs to reach accuracy thresholds (97%, 97.5%, 98%, 98.5%) - 20 epochs to see long-term behavior

Results - Final Accuracies:

Rebel % Final Train Final Test
0% 95.09% 98.87%
5% 95.61% 98.93%
10% 95.70% 98.85%
20% 95.27% 98.90%
30% 94.37% 98.75%

Results - Convergence Speed (epochs to reach threshold):

Rebel % 97% 97.5% 98% 98.5%
0% 2 2 4 7
5% 2 2 4 8
10% 2 2 4 7
20% 2 3 5 8
30% 2 4 6 12

Key observations: 1. 30% is too much - takes ~2x longer to reach 98.5% (epoch 12 vs 7), worse final accuracy 2. 5% is the sweet spot - best final test accuracy, same convergence speed as baseline 3. 0-10% behave similarly - no convergence penalty 4. 20% shows slight slowdown - 1 epoch slower to 97.5%, 98% 5. Train accuracy gap widens - rebels hurt training more visibly with smaller network

Interpretation: - With constrained capacity, rebels are more noticeable - 5% rebels: free regularization with no speed penalty - 30% rebels: too much noise, optimization struggles - Sweet spot appears to be 5-10%

Next steps: - Try annealing: start high (20%), decay to 0% - Test per-layer rebel % (more rebels in later layers?) - Compare to dropout (similar regularization mechanism?)

v3 - Fixed Rebel Mask (same weights rebel forever)

Date: 2025-12-29

Changes: - Instead of re-sampling rebel mask each step, create it ONCE and keep it fixed - Same weights are always rebels (permanent gradient ascent on those params!)

Results - Fixed Mask vs Random Mask:

Rebel % v2 Random Test v3 Fixed Test Difference
0% 98.87% 98.87% same
5% 98.93% 98.21% -0.7%
10% 98.85% 96.25% -2.6%
20% 98.90% 91.37% -7.5%
30% 98.75% 32.75% -66% COLLAPSED

Key observations: 1. Fixed mask is catastrophic - 30% fixed rebels = training collapse (32% ≈ random guessing) 2. Loss goes UP - with fixed mask, train loss increases over time (gradient ascent wins) 3. Accuracy declines - 10% fixed: starts at 97.9%, ends at 96.2% (getting worse!) 4. The randomness is essential - random re-sampling = noise/regularization; fixed mask = permanent damage

Interpretation: - With random mask: any given weight only occasionally rebels, can still learn overall - With fixed mask: same weights do gradient ASCENT forever, actively maximizing loss - This proves RebelAdamW's benefit comes from stochastic exploration, not from "some weights going backwards"

Conclusion: Random mask = structured noise regularization. Fixed mask = sabotage.

v4 - CIFAR-10 (Harder Task)

Date: 2025-12-29

Changes: - Switch from MNIST to CIFAR-10 (32x32 RGB images, 10 classes) - New CIFAR10CNN architecture (~582K params) - Data augmentation (RandomCrop with padding, RandomHorizontalFlip) - Quick focused test: 0% vs 5% rebels, 10 epochs

Results - Final Accuracies:

Rebel % Final Train Final Test
0% 68.87% 72.85%
5% 68.02% 72.55%

Results - Convergence Speed (epochs to reach threshold):

Rebel % 50% 60% 70% 75%
0% 2 3 7 N/A
5% 2 3 8 N/A

Key observations: 1. Results nearly identical - only 0.3% difference in final test accuracy 2. 5% slightly slower to 70% - epoch 8 vs epoch 7 (1 epoch slower) 3. Neither hit 75% in 10 epochs - need longer training 4. Train accuracy gap - 0% trains slightly better (68.87% vs 68.02%) 5. Test loss similar - both ending around 0.77-0.81

Interpretation: - On this harder task, 5% rebels show neither clear benefit nor harm - The regularization effect seen in MNIST may be masked by other factors (dropout, data aug) - Need longer training to see if rebels help in later epochs (escape local minima?)

Next steps: - Run longer test (30-50 epochs) to see if curves diverge - Try wider range of rebel % on CIFAR-10 - Compare with/without data augmentation (rebels might help more without aug)

TL/DR

As you can see from the automated report above, the ideas worked(ish) but I assume it is simply a form of noise that helps generalization, and the "rebel" aspect of it is really just a over-complicated way of doing this.