Length extrapolation aims to ensure that the model continues to perform well, even when the number of input tokens during inference exceeds the size of the context window on which the model is trained.

More recently, LongLora propose shift short attention to approximate full attention. However, all these methods require Full-length fine-tuning, suffering computational cost that grows with target context size. By contrast, this method managed to decouple train/target length, requiring only the original context size for fine-tuning.

Figure 2: (a) Prompt template used for passkey retrieval; (b) retrieval accuracy for the non-fine-tuned LLaMA model (None), and the PoSE-extended counterparts for 16k / 32k window size. Both PoSE- extended models maintain a high retrieval accuracy (≥ 90%) within their respective context window.

The PoSE-extended models exhibit only marginal performance degradation compared with Full-length fine-tuning and the original version.

Llama v1 links:

LLama 7B POSE YaRN 16K

LLama 7B POSE YaRN 128K