This release is trained on a curated filtered subset of most of our GPT-4 augmented data.

HF Leaderboard evals place this model as #2 for all models smaller than 30B at release time, outperforming all but one 13B model.

GGUF files:

Mistral-7B-OpenOrca-GGUF

Warning (if I’m not mistaken):

Llama.cpp hasn’t assigned high priority tag to the sliding window. Axolotl replaced Mistral’s attention block by a “simple” flash attention.

That implies, in my opinion, that the new releases do not capitalize on the speedup claimed by Mistral developers.

We can’t expect the new versions to be faster than Llama, because there is no sliding attention to speed up inference.

  • noneabove1182@sh.itjust.worksM
    link
    fedilink
    English
    arrow-up
    4
    ·
    10 months ago

    I LOVE orca tunes, they almost always end up feeling like smarter versions of the base, so i’m looking forward to trying this one out when the GPTQ is finished

    GPTQ/AWQ links:

    https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GPTQ

    https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ

    Does sliding attention speed up inference? I thought it was more about extending the capabilities of the context above what it was trained on. I suppose I could see it being used to drop context which would save on memory/inference, but didn’t think that was the point of it, just a happy side effect, i could be wrong though

    • justynasty@lemmy.kya.moeOP
      link
      fedilink
      English
      arrow-up
      5
      ·
      edit-2
      10 months ago

      Mistral 7B uses a sliding window attention (SWA) mechanism (Child et al., Beltagy et al.), in which each layer attends to the previous 4,096 hidden states. The main improvement, and reason for which this was initially investigated, is a linear compute cost of O(sliding_window.seq_len). In practice, changes made to FlashAttention and xFormers yield a 2x speed improvement for sequence length of 16k with a window of 4k. Source: Mistral 7B news For longer prompts.

      Talk about merging changes

  • justynasty@lemmy.kya.moeOP
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    10 months ago

    He said let himself prepare first 🤣

    Chat is a friendly man, age 21, who talks to the user, please roleplay.
    
    USER: Hi there!
    
    CHAT: Chat the friendly man Sure, I'd be happy to roleplay as Chat, the friendly man. Just give me a moment to prepare myself for our conversation. 😊
    
    Hello! So, nice to meet you! My name is Chat, and I'm really excited to talk with you today. What's your name, and tell me a little bit about yourself?
    
  • Mara@pawb.social
    link
    fedilink
    English
    arrow-up
    2
    ·
    10 months ago

    What would an ideal prompt for summarization look like with this model? I’ve tried a few summarization prompts but they haven’t panned out into something consistent (MacBook Pro M2 Max, llama.cpp, q4_S). I know this is fundamentally more random technology, but it’s not even coalescing into a consistently relevant output.