Experiment suggestion: distillation in training

Here’s a suggestion for an experimental direction. I’m mostly trying out the format here, so feel free to shoot this down if it’s out of scope for the workgroup. If anybody else is interested in pursuing this, I’d be happy to help code this up.

A/ Scientific question

Does distillation during training confer any of the following benefits?

a) Smaller models with the same capacity
b) More flexible parallel training
c) Better calibrated uncertainties

B/ Hypothesis/intuition

a) Smaller models

It’s well known that after training a large language model can be distilled into a much smaller model, while retaining much of the performance (cf. DistilBERT, DistllGPT2). The large parameter space is required for training by SGD to work well, but after training is done, the model can be compressed a lot.

What if we do some of this distillation during training? The simplest approach would be to train an 2n-layer model, distill it to n layers and then train n new layers on top of the old model. Does this allow us to train a more powerful model with the same computational budget?

b) Parallel training

Codistillation ([1804.03235] Large scale distributed neural network training through online distillation) is a parallelized training method where input/output vectors are shared between workers, rather than gradient updates. Combining this with (a), we may be able to use codistillation on a heterogeneous pool of workers, distilling large models into small ones on the fly.

c) Better calibrated uncertainties

Ensembles (such as MC dropout ensembles) can often be trained to provide better calibrated uncertainty estimates than single models. Distilling such an ensemble into a single model (either afterwards or on the fly) then provides the best of both worlds. Combining this with (b) can we treat the current pool of workers as an ensemble, learning from itself.

C/ Experimental setup

a) Train a relatively small autoregressive model end-to-end and compare it to the same model trained with a single distillation/expansion step in the middle, keeping the computation budget fixed.
a.1) Find the best tradeoff between the time spent distilling and training
b) TBD
c) TBD

D/ Results

This experiment should tell us whether distillation can:

  • Help us get more value out of a fixed budget of GPU hours.
  • Provide a flexible framework to control large scale training.
  • Provide benefits towards robustness and calibration.
1 Like