Assumption: Prompts at the time of decoding is enough for contextual generation
prompt at the encoder will affect on decoding phase. A prompt may refer to a task, a sentiment
or an additional context and generalize automatically.
- A default (prompt_0) prompt for training would be
- For multilingual training, two additional prompts could be “enc-lang”, “dec-lang”. This prompt could be included in training data with a probability of .5 since there are some indications (XLM, conneau et. al., Dufter et al.2021) that adding language anchors may decrease multilingual performance. But my take here is, at a large scale, this should not be an issue.
To add different kinds of metadata (a.k.a. prompt) for generation, we can add multiple channels of prompts.
Some more clear description of the multi-channel prompt training,
- Keep multiple prompt channels for text streams.
- Do not perform self-attention in between prompts and text stream in the encoder.
- Forward pass on the text stream and prompts in parallel and get their representation.
- Perform cross-attention on top of the full representation of text stream and prompt at the decoder.
- Models can learn the combination of multiple prompt channels in linear memory complexity.
- Easy to integrate larger prompts in encoder.
- The encoder should not be conditioned by the prompt.
- Should be more interesting for Lifelong learning through prompt conditioning at the decoder.
- Because we will not be conditioning on the prompt in the encoder, in a life-long learning scenario, adding a new task may preserve the previous task better (but it’s not the ideal case).
- We can design complex tasks through a combination of prompts. Like, translation, summarization, and HTML data (structure information, topic context, etc).
- [A question from Timo] Can’t we already combine prompts now? E.g., if we want summarization + translation, we could say “Please write a summary in French”. Or we could use two prompts consecutively, something like: “Please write a summary. [MASK]. Please translate the summary to French. [MASK]”
- My observation here is “Please write a summary in French” is much harder to generalize across languages. Anchoring on different smaller prompts (think of it as finding a dimension for each of the prompts for decoding in a hyperspace) should be more efficient than a single string-based prompt. i.e, “Please write a summary in language X” may introduce a curse of dimensionality across many languages. Where an LM may learn prompt:translation, prompt:src_lang, prompt:tgt_lang and may be able to translate. Now add a new prompt, prompt:summarize. Now the language model should translate the source text into the target language and summarize them in the decoding step. My intuition is, this should perform very well in Zero-shot or Few-shot scenario. Maybe we don’t have summarization data for a target language but the LM would perform summarization. If that happens, that would be great.
- Decoder complexity increases for larger prompts. However, Encoders complexity don’t increase because the prompts are in different channels.