Attention is how information is shared between tokens in a sequence. This ensures that the model is causal, i.e. it can only use information from the past to predict the future.

We see that longer contexts do help even outside the sliding window but when the sequence length becomes too large, the model does not use the full context anymore.

When generating a sequence, we need to predict tokens one-by-one. However, the prompt is known in advance, and we can pre-fill the (k, v) cache with the prompt.

Reminder how attention works

  • noneabove1182@sh.itjust.worksM
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 year ago

    This is great and comes with a very interesting model!

    I wonder if they cleverly slide the window in any way or if it’s just a naive slide, could probably be pretty smart if you discard tokens that have minimal attention on them anyways to focus on important text

    For now, this is awesome!