01 Pruning weights

Optimal Brain Damage (OBD)⚓︎

Oldest work. How to define saliency of a weight, besides simply using its magnitude? Using change in the objective function, caused by deleting/changing the parameter.

Drawbacks - Computationally prohibitive as second derivative computations are expensive. - Cross terms in the Hessian matrix are ignored.

Optimal Brain Surgeon⚓︎

Babak Hassibi, David Stork, Second order derivatives for network pruning: Optimal Brain Surgeon (NIPS 1992)

Use information from all second order derivatives of the error function to perform network pruning.
Also, unlike other methods (like OBD or magnitude pruning), OBS does not demand (typically slow) retraining after the pruning of a weight.
Computationally prohibitive as second derivative computations are expensive.

Deep Compression⚓︎

Song Han, Huizi Mao, William J. Dally, Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

A more computationally feasible method for pruning connections and relearning weights based solely on the magnitude of the original weights

Untitled

Pruning encoder-decoder models⚓︎

Abigail See, Minh-Thang Luong, Christopher D. Manning, Compression of Neural Machine Translation Models via Pruning

Untitled

Pruning Schemes⚓︎

How do we distribute the pruning over the different weight classes of our model?
- Class-blind: Take all parameters, sort them by magnitude and prune the x% with smallest magnitude, regardless of weight class.
- Class-uniform: Within each class, sort the weights by magnitude and prune the x% with smallest magnitude.
- Class-distribution: For each class c, weights with magnitude less than \(\lambda\) are pruned.
Retraining the sparse pruned network helps. Retrain with smaller learning rate (LR), simpler LR schedule (halve every 2 epochs), and train for fewer epochs.
Class-blind pruning outperforms both other schemes.

Iterative Pruning⚓︎

Song Han, Jeff Pool, John Tran, William J. Dally Learning both Weights and Connections for Efficient Neural Networks, 2015

Untitled

Regularization (L1/L2) while training.
Fixed threshold is used for magnitude pruning in every iteration.
Dropout Ratio Adjustment
- During retraining, the dropout ratio must be adjusted to account for the change in model capacity.
- As pruning already reduced model capacity, the retraining dropout ratio should be smaller.

Iterative Magnitude Pruning for Transformers⚓︎

Robin Cheong, Robel Daniel, Compressing Transformers with Pruning and Quantization, 2019 ERNIE github

For starting proportion X% and ending proportion Y%, our iterative magnitude pruning procedure pruned X% of each of the pre-trained Transformer layers, began re-training, and pruned (Y -X)/9 % of each of the layers every 1001 iterations.

Untitled

By the 10,000th iteration, we reached Y% pruning of the model iteratively.
Do not factor in word embeddings in compression rate.

Sparse BERT with improved pruning⚓︎

Fu-Ming Guo, Sijia Liu, Finlay S. Mungall, Xue Lin, Yanzhi Wang, Reweighted Proximal Pruning for Large-Scale Language Representation, 2019

Two problems with pruning
- The larger weight \(w_i\), is penalized more heavily than smaller weight \(w_i\) in \(l_1\) regularization, which violates the original intention of weight pruning, “removing the unimportant connections”.
- Direct optimization of a regularization penalty term causes divergence from the original loss function and has negative effect on the effectiveness of gradient-based update.
Solution using reweiehted proximal pruning (which depends on proximal operators)
- Decouples the goals of high sparsity from minimizing loss.
NIP: Progressive/gradual pruning without regularizers.

Untitled

Even when 90% of weights are pruned, next sentence prediction accuracy keeps above 95% in RPP. 80% pruning for most GLUE tasks and 41% for SQuAD 1.1 at 0 degradation for BERT BASE.

Should we prune large networks or build small dense networks?⚓︎

Michael Zhu, Suyog Gupta, To prune, or not to prune: exploring the efficacy of pruning for model compression, 2017

Untitled

Pruning involves extra processing plus sparse matrices need special handling - can we avoid it
Large-sparse models consistently outperform small-dense models and achieve up to 10x reduction in number of non-zero parameters with minimal loss in accuracy.
Models: stacked LSTMs for language modeling, and seq2seq models for NMT.