Mistral 7B OpenOrca released

justynasty@lemmy.kya.moe · edit-2 11 months ago

Mistral 7B OpenOrca released

justynasty@lemmy.kya.moe · edit-2 11 months ago

Mistral 7B uses a sliding window attention (SWA) mechanism (Child et al., Beltagy et al.), in which each layer attends to the previous 4,096 hidden states. The main improvement, and reason for which this was initially investigated, is a linear compute cost of O(sliding_window.seq_len). In practice, changes made to FlashAttention and xFormers yield a 2x speed improvement for sequence length of 16k with a window of 4k. Source: Mistral 7B news For longer prompts.

Talk about merging changes

noneabove1182@sh.itjust.works · 11 months ago

Ah good point, definitely looking forward to it being implemented then

Mistral 7B OpenOrca released

Mistral 7B OpenOrca released

Open-Orca/Mistral-7B-OpenOrca · Hugging Face