Discussion from Dory: how to optimize inference?

Some people raise good points on how to optimize those models for inference time. Creating this thread to discuss that point.

1 Like

Just gathering the questions for easier reference:

[Patrick Lewis: ]
We’re interested in training a very large model (which is awesome).
When the experiment is finished, do we have a plan of how we might enable lots of people to use it/query it?
Most groups probably wont have the compute resources to run/use a super-large model by themselves.
specifically it would be awesome to think about pooling resources for hosting inference - after all, if we do a great job on building the model, everyone will want to use it (and we want to make this easy, fair and not a profit-making opportunity…)

[Antoine SIMOULIN]
[Technical question on the practical use of the model] Extremely large language models might be difficult to use in practice, due to hardware requirements, memory consumption, low inference time… Besides, some users might want to host the model on private infrastructure due to privacy issues. In such cases, should we consider ways to facilitate the deployement of the models? (for instance, compression methods such as distillation, wheight sharing, knowledge distillation … )

2 Likes

Hi,

Greatly interested in that topic. I have experience in building custom inference engines for Transformer models. So I will be happy to participate to any discussion or development related to inference acceleration.

Just to confirm, is this in the scope of this workshop?

Added a working group on this (how to make the model accessible).

Since we can already fit a 11B parameters T5 model on a single GPU with approaches like DeepSpeed’s ZeRO. I think that with similar CPU/HDD-offloading approaches we should be able to make the model (slow but) usable on a single GPU as long as someone has enough memory to store the intermediate activation (with ZeRO infinity on a server with enough memory you can for instance run a trillion parameters model on a single GPU).

If latency is a requirements for some researchers wanting to work on the model we may be able to setup an inference service but this would require some fundings.