01 Pruning weights
Optimal Brain Damage (OBD)⚓︎
- Yann LeCun, John Denker, Sara Solla, Optimal Brain Damage (NIPS 1989)
- Оптимальное прореживание нейронных сетей
Oldest work. How to define saliency of a weight, besides simply using its magnitude? Using change in the objective function, caused by deleting/changing the parameter.
Drawbacks - Computationally prohibitive as second derivative computations are expensive. - Cross terms in the Hessian matrix are ignored.
Optimal Brain Surgeon⚓︎
- Use information from all second order derivatives of the error function to perform network pruning.
- Also, unlike other methods (like OBD or magnitude pruning), OBS does not demand (typically slow) retraining after the pruning of a weight.
- Computationally prohibitive as second derivative computations are expensive.
Deep Compression⚓︎
A more computationally feasible method for pruning connections and relearning weights based solely on the magnitude of the original weights
Pruning encoder-decoder models⚓︎
Pruning Schemes⚓︎
- How do we distribute the pruning over the different weight classes of our model?
- Class-blind: Take all parameters, sort them by magnitude and prune the x% with smallest magnitude, regardless of weight class.
- Class-uniform: Within each class, sort the weights by magnitude and prune the x% with smallest magnitude.
- Class-distribution: For each class c, weights with magnitude less than \(\lambda\) are pruned.
- Retraining the sparse pruned network helps. Retrain with smaller learning rate (LR), simpler LR schedule (halve every 2 epochs), and train for fewer epochs.
- Class-blind pruning outperforms both other schemes.
Iterative Pruning⚓︎
- Regularization (L1/L2) while training.
- Fixed threshold is used for magnitude pruning in every iteration.
- Dropout Ratio Adjustment
- During retraining, the dropout ratio must be adjusted to account for the change in model capacity.
- As pruning already reduced model capacity, the retraining dropout ratio should be smaller.
Iterative Magnitude Pruning for Transformers⚓︎
Robin Cheong, Robel Daniel, Compressing Transformers with Pruning and Quantization, 2019 ERNIE github
- For starting proportion X% and ending proportion Y%, our iterative magnitude pruning procedure pruned X% of each of the pre-trained Transformer layers, began re-training, and pruned (Y -X)/9 % of each of the layers every 1001 iterations.
- By the 10,000th iteration, we reached Y% pruning of the model iteratively.
- Do not factor in word embeddings in compression rate.
Sparse BERT with improved pruning⚓︎
- Two problems with pruning
- The larger weight \(w_i\), is penalized more heavily than smaller weight \(w_i\) in \(l_1\) regularization, which violates the original intention of weight pruning, “removing the unimportant connections”.
- Direct optimization of a regularization penalty term causes divergence from the original loss function and has negative effect on the effectiveness of gradient-based update.
- Solution using reweiehted proximal pruning (which depends on proximal operators)
- Decouples the goals of high sparsity from minimizing loss.
- NIP: Progressive/gradual pruning without regularizers.
- Even when 90% of weights are pruned, next sentence prediction accuracy keeps above 95% in RPP. 80% pruning for most GLUE tasks and 41% for SQuAD 1.1 at 0 degradation for BERT BASE.
Should we prune large networks or build small dense networks?⚓︎
- Pruning involves extra processing plus sparse matrices need special handling - can we avoid it
- Large-sparse models consistently outperform small-dense models and achieve up to 10x reduction in number of non-zero parameters with minimal loss in accuracy.
- Models: stacked LSTMs for language modeling, and seq2seq models for NMT.