Knowledge Distillation⚓︎
Train student to mimic a pre trained, larger teacher
Decoupled Knowledge Distillation, CVPR-2022, pdf, git
Distillation methods vary on: - different types of teacher model - different types of loss function - squared error between the logits of the models - KL divergence between the predictive distributions, or - some other measure of agreement between the model predictions. - different choices for what dataset the student model trains on. - a large unlabeled dataset - a held-out data set, or - the original training set. - Mimic what? - Teacher's class probabilities - Teacher's feature representation - Learn from whom? - Teacher, teacher assistant, other fellow students - Adversarial learning
Distilling the Knowledge in a Neural Network⚓︎
Geoffrey Hinton, Oriol Vinyals, Jeff Dean Distilling the Knowledge in a Neural Network, 2015
- The relative probabilities of incorrect answers tell us a lot about how the / teacher model tends to generalize.
- Softmax with temperature \(q_i = \frac {exp(z_i/T)}{\sum_j(z_j/T)}\)
- Use cross entropy (H) for both the soft and hard part of the loss function. Typically smaller weight for hard part works better.
-
When using both hard and soft loss, since the magnitudes of the gradients produced by the soft targets scale as \(1/T^2\) it is important to multiply them by \(T^2\) when using both hard and soft targets.
\(L_{KD}(W_s) = H(y_true, P_s) + \lambda H(P_T, P_S)\)
Sobolev Training for Neural Networks⚓︎
a) Sobolev Training of order 2. Diamond nodes \(m\) and \(f\) indicate parameterised functions, where \(m\) is trained to approximate \(f\). Green nodes receive supervision. Solid lines indicate connections through which error signal from loss \(l\), \(l_1\), and \(l_2\) are backpropagated through to train \(m\). b) Stochastic Sobolev Training of order 2. If \(f\) and \(m\) are multivariate functions, the gradients are Jacobian matrices. To avoid computing these high dimensional objects, we can efficiently compute and fit their projections on a random vector \(v_j\) sampled from the unit sphere
- Sobolev Training for Neural Networks, NIPS 2017, pdf
Noisy teachers⚓︎
Let's include a noise-based regularizer while training the strudent from the teacher. - Noise is Gaussian noise with mean 0 and std dev \(\sigma\) . This noise is added to teachers logits. - Noisy teachr is more helpful than a noisy student
Training Shallow Students using the proposed Logit Perturbation Method
Deep Model Compression: Distilling Knowledge from Noisy Teachers, 2016, pdf
Distillation architectures⚓︎
- Learning students and teacher together
- Multiple teachers
- Adversarial methods
- https://huggingface.co/docs/transformers/model_doc/distilbert
- https://huggingface.co/huawei-noah/TinyBERT_General_4L_312D
- Fine Tuning DistilBERT for MultiLabel Text Classification colab
Multi-step knowledge distillation⚓︎
or chain distillation - Student network performance degrades when the gap between student and teacher is large. - Multi-step knowledge proposes to use intermediate sized network (teacher assistant) to bridge the gap between the student and the teacher TA fills the gap between student & teacher - Improved Knowledge Distillation via Teacher Assistant, 2019, pdf
Distillation from many teachers⚓︎
or an ensemble of specialists - Teacher model could be an ensemble that contains: - one generalist model, trained on all the data - many specialist models, each of which is trained on data, that is highly enriched in examples from a very confusable subset of the classes - The softmax of this type of specialist can be made much smaller by combining all of the classes it doesn't care about into single \(pad\) class - Training the student - 1. For each instance, we find the \(n\) most propable classes acording to the generalist model, (\(K\) - known) - 2. We take all the specialist models \(m\) whose subset of confusable classes \(S^m\), has a non-empty intersection with \(K\) and call this the active set of specialists \(A_k\). Then, we find the full probability distribution \(q\) over all the classes that minimizes \(KL(gen, q) + \sum{KL(p^m,q)}\)
KD with Adversarial methods⚓︎
Knowledge Distillation with Adversarial Samples Supporting Decision Boundary, 2018, pdf The concept of knowledge distillation using samples close to the decision boundary. The dots in the figure represent the training sample and the circle around a dot represents the distance to the nearest decision boundary. The samples close to the decision boundary enable more accurate knowledge transfer. Iterative scheme to find boundary supporting samples for a base sample