Implementation details

It fiddles with the position embeddings that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning.

Perplexity (PPL) stays stable

Regarding my previous post, I don’t see performance increases in this implementation for a shorter context size.

Edit: They’ve added FAQ.

Is the context window of LLMs expanded?

No. The context window remains unchanged. Only the most recent tokens and attention sinks are retained, discarding middle tokens. This means the model can only process the latest tokens. The context window remains constrained by its initial pre-training.

Can I input an extensive text, like a book, into StreamingLLM for summarization?

While you can input a lengthy text, the model will only recognize the latest tokens. Thus, if a book is an input, StreamingLLM might only summarize the concluding paragraphs, which might not be very insightful.