Rebel AdamW Optimizer
rebel-adamw
Hypothesis
Idea: Modify AdamW so that a random percentage of parameters move opposite to what the gradient suggests each step. Call these "rebel" parameters.
Why it might be interesting: - Acts as structured exploration noise - some weights actively probe "wrong" directions - Could help escape sharp local minima (similar intuition to sharpness-aware minimization) - Might regularize by preventing all weights from jointly overfitting - Like "dropout for optimization" - a random subset rebels against the gradient
Expected effects: - Low rebel % (1-5%): Slight regularization, possibly helps generalization - Medium rebel % (10-20%): Slower convergence but maybe better final loss - High rebel % (30%+): Training diverges or fails to converge
Key design questions: 1. Flip full Adam update vs just the gradient component? 2. Fixed mask per step vs re-sample each step? 3. Should rebels still update momentum buffers normally (but flip sign at application)? 4. Anneal rebel % over training?
v1 approach: Simple implementation - re-sample rebel mask each step, flip the full update for rebels. Test on MNIST to see if it trains at all.
Log
v1 - Initial Implementation
Date: 2025-12-29
Changes: - Implement RebelAdamW optimizer (extends AdamW, flips update for random % of params) - Simple MNIST CNN test comparing rebel_pct=0%, 5%, 10%, 20% - Track train/test loss and accuracy curves
Results:
| Rebel % | Final Train Acc | Final Test Acc |
|---|---|---|
| 0% | 99.17% | 99.15% |
| 5% | 99.14% | 99.09% |
| 10% | 99.10% | 99.10% |
| 20% | 99.16% | 99.24% |
Key observations: 1. Training still works! Even with 20% of parameters actively fighting the gradient, the model converges normally 2. 20% rebels got best test accuracy - 0.09% higher than baseline (99.24% vs 99.15%) 3. Test loss is smoother with rebels - baseline (0%) shows more variance in test loss curve 4. No convergence slowdown - train loss curves are nearly identical across all rebel %
Interpretation: - The random sign-flipping acts as a form of regularization - Similar effect to adding noise, but structured (direction reversal vs magnitude perturbation) - MNIST may be too easy - need harder task to see if rebels help escape local minima
Next steps: - Try higher rebel % (30%, 40%, 50%) to find breaking point - Test on CIFAR-10 (harder task, might show clearer signal) - Try annealing rebel % down over training - Compare to just adding Gaussian noise to updates
v2 - TinyCNN + Convergence Speed Analysis
Date: 2025-12-29
Changes: - Much smaller TinyCNN (75K params vs ~1.2M) to make task harder - Added 30% rebel to find breaking point - Track epochs to reach accuracy thresholds (97%, 97.5%, 98%, 98.5%) - 20 epochs to see long-term behavior
Results - Final Accuracies:
| Rebel % | Final Train | Final Test |
|---|---|---|
| 0% | 95.09% | 98.87% |
| 5% | 95.61% | 98.93% |
| 10% | 95.70% | 98.85% |
| 20% | 95.27% | 98.90% |
| 30% | 94.37% | 98.75% |
Results - Convergence Speed (epochs to reach threshold):
| Rebel % | 97% | 97.5% | 98% | 98.5% |
|---|---|---|---|---|
| 0% | 2 | 2 | 4 | 7 |
| 5% | 2 | 2 | 4 | 8 |
| 10% | 2 | 2 | 4 | 7 |
| 20% | 2 | 3 | 5 | 8 |
| 30% | 2 | 4 | 6 | 12 |
Key observations: 1. 30% is too much - takes ~2x longer to reach 98.5% (epoch 12 vs 7), worse final accuracy 2. 5% is the sweet spot - best final test accuracy, same convergence speed as baseline 3. 0-10% behave similarly - no convergence penalty 4. 20% shows slight slowdown - 1 epoch slower to 97.5%, 98% 5. Train accuracy gap widens - rebels hurt training more visibly with smaller network
Interpretation: - With constrained capacity, rebels are more noticeable - 5% rebels: free regularization with no speed penalty - 30% rebels: too much noise, optimization struggles - Sweet spot appears to be 5-10%
Next steps: - Try annealing: start high (20%), decay to 0% - Test per-layer rebel % (more rebels in later layers?) - Compare to dropout (similar regularization mechanism?)
v3 - Fixed Rebel Mask (same weights rebel forever)
Date: 2025-12-29
Changes: - Instead of re-sampling rebel mask each step, create it ONCE and keep it fixed - Same weights are always rebels (permanent gradient ascent on those params!)
Results - Fixed Mask vs Random Mask:
| Rebel % | v2 Random Test | v3 Fixed Test | Difference |
|---|---|---|---|
| 0% | 98.87% | 98.87% | same |
| 5% | 98.93% | 98.21% | -0.7% |
| 10% | 98.85% | 96.25% | -2.6% |
| 20% | 98.90% | 91.37% | -7.5% |
| 30% | 98.75% | 32.75% | -66% COLLAPSED |
Key observations: 1. Fixed mask is catastrophic - 30% fixed rebels = training collapse (32% ≈ random guessing) 2. Loss goes UP - with fixed mask, train loss increases over time (gradient ascent wins) 3. Accuracy declines - 10% fixed: starts at 97.9%, ends at 96.2% (getting worse!) 4. The randomness is essential - random re-sampling = noise/regularization; fixed mask = permanent damage
Interpretation: - With random mask: any given weight only occasionally rebels, can still learn overall - With fixed mask: same weights do gradient ASCENT forever, actively maximizing loss - This proves RebelAdamW's benefit comes from stochastic exploration, not from "some weights going backwards"
Conclusion: Random mask = structured noise regularization. Fixed mask = sabotage.
v4 - CIFAR-10 (Harder Task)
Date: 2025-12-29
Changes: - Switch from MNIST to CIFAR-10 (32x32 RGB images, 10 classes) - New CIFAR10CNN architecture (~582K params) - Data augmentation (RandomCrop with padding, RandomHorizontalFlip) - Quick focused test: 0% vs 5% rebels, 10 epochs
Results - Final Accuracies:
| Rebel % | Final Train | Final Test |
|---|---|---|
| 0% | 68.87% | 72.85% |
| 5% | 68.02% | 72.55% |
Results - Convergence Speed (epochs to reach threshold):
| Rebel % | 50% | 60% | 70% | 75% |
|---|---|---|---|---|
| 0% | 2 | 3 | 7 | N/A |
| 5% | 2 | 3 | 8 | N/A |
Key observations: 1. Results nearly identical - only 0.3% difference in final test accuracy 2. 5% slightly slower to 70% - epoch 8 vs epoch 7 (1 epoch slower) 3. Neither hit 75% in 10 epochs - need longer training 4. Train accuracy gap - 0% trains slightly better (68.87% vs 68.02%) 5. Test loss similar - both ending around 0.77-0.81
Interpretation: - On this harder task, 5% rebels show neither clear benefit nor harm - The regularization effect seen in MNIST may be masked by other factors (dropout, data aug) - Need longer training to see if rebels help in later epochs (escape local minima?)
Next steps: - Run longer test (30-50 epochs) to see if curves diverge - Try wider range of rebel % on CIFAR-10 - Compare with/without data augmentation (rebels might help more without aug)
TL/DR
As you can see from the automated report above, the ideas worked(ish) but I assume it is simply a form of noise that helps generalization, and the "rebel" aspect of it is really just a over-complicated way of doing this.