Alpha Histogram
Distribution of per-layer α. Good: mass near α≈2–4
. Bad: many layers with α<2
(overfitting/memorization) or α>6
(underfit).
Use this prompt to keep the WeightWatcher Help Assistant aligned with HTSR guidance and the in-app metrics whenever you refresh or reconfigure the model.
Interpretation guidance grounded in HTSR (Heavy-Tailed Self-Regularization) and WeightWatcher metrics.
WeightWatcher analyzes layer spectra to reveal training dynamics. In HTSR terms, well-fit layers typically have
α ≈ 2–4
, few correlation traps, detX → 0, low KS distance, and
high (1−random distance). We call the log spectral norm the layer’s scale.
Smaller layer IDs are closer to the data; larger layer IDs are closer to the labels.
For model vs base α, the lower-right corner (often highlighted in green) is best.
Lower is better unless otherwise noted. For (1−RD), higher is better because we negate RD.
Correlation traps are unstable spectral spikes; their presence typically indicates overfitting.
α≈2–4
well-fit, α<2
overfitting/memorization, α>6
underfit.detX → 0
at critical balance.Distribution of per-layer α. Good: mass near α≈2–4
. Bad: many layers with α<2
(overfitting/memorization) or α>6
(underfit).
Scatter of α vs KS distance. Good: low KS at α≈2–4
. Spikes in KS flag layers deviating from random-matrix predictions.
Lower is better; broad/high tails suggest spectral instability or overfitting.
Layer-wise KS. Flat/low is stable; spikes mark suspect layers (remember: small IDs ≈ data, large IDs ≈ labels).
We plot 1−RD, so higher is better. Well-trained layers cluster near α≈2–4
with high (1−RD).
Histogram of 1−RD. High values → HTSR-consistent; low values → spectral deviation/instability.
High/flat flows are better; dips pinpoint problematic layers along depth.
In HTSR, detX → 0
indicates critical balance/self-regularization. Large |detX| suggests deviation from criticality or unstable dynamics.
Most layers should concentrate near 0. Outliers away from 0 indicate instability or collapse.
Lines near 0 are best (critical balance). Oscillations/large deviations highlight mis-trained or unstable regions.
Trap counts vs α. Presence of traps indicates overfitting; fewer traps near α≈2–4
is better.
Lower is better; heavy tails imply overfitting and unstable correlations.
Low/flat flows are better; rises flag growing overfitting along depth.
Layer-wise scale (log spectral norm). Smooth trends suggest stability; sharp jumps flag layer-scale imbalance or blow-ups.
Correlations/traps evolving across depth. Rising structure indicates growing instability; flatter/declining patterns suggest stable learning.
Compare α across models by layer. α≈2–4
marks well-fit; α<2
overfitting/memorization; α>6
underfit. Remember the layer order (data→labels).
Tracks each model’s scale (log spectral norm). Smooth/parallel flows are stable; divergence or jumps indicate mismatched layer scales.
Mean α per model (filtered to 2≤α≤6
). Lower is better (more layers in the well-fit regime).
Mean effective rank across models. Lower is better (more compression/regularization), assuming accuracy is maintained.
Average trap counts per model. Lower is better; more traps → more overfitting risk.
How the fine-tuned model’s α/correlation patterns evolve versus the base. Divergences flag layers altered by training (data→labels across IDs).
Layer-wise scale (log spectral norm) compared to the base. Highlights where scales drift or remain aligned.
Per-layer α (model vs base). Lower-right corner (green) is best: model α lower/better-fit while base is higher/worse.
We display 1−RD, so larger values are better (closer to ideal random-matrix behavior) relative to the base.
Near-zero detX indicates critical balance/self-regularization. Large |detX| means deviation from criticality. Compare the model’s detX to the base across layers.
Trap counts per layer versus base. Fewer traps than the base is better; traps indicate overfitting.
Emphasizes large singular values. Lower weighted α than base is better (heavier-tailed yet well-fit, ideally within α≈2–4
).
Effective/stable rank vs base. Lower is better if accuracy holds; large increases can indicate dispersion/instability.
Average α over layers. Lower is better (more layers in the well-fit band). Filtered variants focus on 2≤α≤6
.
Lower is better (more compression/regularization without instability).
Bars show 1−RD. Higher is better (closer to ideal random-matrix behavior).
Lower is better (fewer overfitting spikes).
For hands-on training, fine-tuning, or deeper model diagnostics, reach out to Charles Martin via LinkedIn.
The weightwatcher tool has been developed by Calculation Consulting. We provide consulting to companies looking to implement Data Science, Machine Learning, and/or AI solutions. Reach out today to learn how to get started with your own AI project. Email: Info@CalculationConsulting.com Please review our Terms of Service and Privacy Policy.