• 14 Posts
  • 28 Comments
Joined 10 months ago
cake
Cake day: September 5th, 2023

help-circle
  • You can download 7B, 13B Q_8 models for such gpu. 30B Q_2 models would probably run out of memory.

    This shows that larger models have lower perplexity (i.e. more coherent). You can run conversational models, but not those with infinite knowledge base.

    Most of the paid services that provide open-source models use 13B models (for $15 per month); you can run those for free on your card.

    Someone else needs to recommend a tool to run models locally.



  • TL;DR You’re right, pytorch and transformers need more memory.

    I will respond to the CPU inference first, for the transformers library.

    Do you use it on CPU? If yes, do you use quantization? Which one? I’d like to try myself. But I don’t wan’t to try all the possibilities to find out which one works.

    In transformers, I don’t use quant. :L If you’re used to the Q_4 speed, then it will be slower than that. For a 7B it’s almost okay on CPU.

    And I seriously doubt it’s getting as fast now that I’m squeezing 15GB of data through the memory bus instead of like 4GB.

    Yeah, it seems like you use low quant downloads. D: It’s not for you.

    But you were on the right track with that 15GB download because you downloaded the raw release. Since not the GPTQ, nor AWS what we use in transformers (for new releases). ^^

    With llama.cpp or KoboldCpp in my case, it’s just git pull; make clean make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1 and you go.

    That’s why I prefer c/rust code, it just works. It will always be faster whatever HF will release, with or without quant.

    Regarding CPU inference and quantization, they are ahead…

    Right, C/Rust code is more optimized.

    If you’re only using HF Transformers for one of its well supported use cases, without a complex project around it and using the same hardware as everone else does, it’s easy.

    Pytorch w/o Nvidia card is less common >:D That’s how I started.

    Imho most github sources release buggy code, they do not set device(‘cpu’) for cpu users. Avoiding dependency hell is a must. I prefer a commented single file, not a complex python project that spits out “bitsandbytes” errors.

    They can do nice things like reuse the KV cache and avoid recalculating the whole context…

    So as HF, in their cough code cough. It is likely that the same code in C is also more readable.

    The reason I mentioned transformers because this line takes care of new model releases with all the bugfixes, just as ***cpp projects do.

    generator= pipeline('text-generation', model='NousResearch/Nous-Capybara-7B')
    generator('Here is the prompt.')
    

    We run out of context?! Fix that. Use rotary embedding.

    model = AutoModelForCausalLM.from_pretrained(
        "NousResearch/Nous-Capybara-7B",
        rope_scaling={"type": "dynamic", "factor": 2.0},
    )
    generator= pipeline('text-generation', model=model)
    generator('Here is the prompt.')
    

    Does it eat all your RAM? It does. It just works ™ with fineprint.

    How to train with tools? Download another tool! With transformers:

    model = AutoModelForCausalLM.from_pretrained(
        "NousResearch/Nous-Capybara-7B",
        use_flash_attention_2=True,
    )
    trainer = SFTTrainer(
        model,
        train_dataset=dataset,
        dataset_text_field="text",
        max_seq_length=512,  # context size is 512
    )
    
    trainer.train() 
    

    Does it eat all you RAM? Yup, it goes beyond 64GB.

    peft_config = LoraConfig(r=16, task_type="CAUSAL_LM")
    trainer = SFTTrainer(
        model,
        train_dataset=dataset,
        dataset_text_field="text",
        peft_config=peft_config
    )
    

    Now it eats less RAM with Lora.


  • I have less faith in llamacpp, new papers pop up every day, HF has more resource to keep up with everything.

    Pytorch works well on CPU (without cuda), people would need a tutorial on how to set up venv (or smth better), use git clone, then assign the AutoModelForCausalLM to the downloaded model, then open a port on 5001. The code fits on the screen without having to scroll.

    Any UI can connect to such a program. Character cards (from chub or other sources) are injected to the prompt from the frontend side. I used to download GGML files because they are lighter, and my download speed is from the past century.











  • Self-attention struggles with long sequences, due to its quadratic dependency on the sequence length. One query would attend to all keys and values, leading to computational inefficiencies. Sparse attention alleviates this issue by restricting the query’s access to a subset of keys and values.

    I think this reduces the training time (improve on quadratic time complexity), but the attention is still not spread out unevenly in an infinitely long text - the question that future models need to answer.


  • This is a fix for a problem that shouldn’t have been there in the first place. One of the many architectural oversights in the Llama model (and its predecessors).

    There isn’t much to see for an end-user. This change won’t mean much to those who use (koboldcpp’s) smart context, summarization or character cards.

    Smaller models (7B down to 350M) can handle long conversations better, and they won’t produce garbage output without the truncation of text.

    I am still waiting for the breakthrough in large models.










  • I also have a de-googled smartphone, with a firewall installed (without a jailbrake). My name doesn’t show up on Google. I use generic usernames, not unique ones. I don’t upload photographs of my relatives to the cloud, as services acquire fingerprint (hash) of their faces and extract metadata from the uploaded jpegs. …and I’m not hiding from anyone, I don’t like the unremovable (unforgettable) traces we leave here.

    translates between arbitraty languages on the fly

    That’s what Firefox has in its browser now. :D desktop version…

    hallucinates less and gets adapters for specific tasks and multimodal capabilities

    People will have less time to talk to other people because they’ll exchange pics with their favorite agent. xd

    And that’s where I expect their gifts to stop. I will still have my chatbot / AI companion.

    There are already services that charge for ML tasks. “You want a calendar notification from AI?” - pay more.

    “You want to summarize your daily emails” - pay double, save more.

    “You want to talk to your friend, who is asleep.” - talk to a virtual AI character, that looks and sounds like your friend. It even remembers your past conversations! /s


  • I was concerned that a large dataset with low sentence similarity may take longer to train. I’m not sure if my idea that novels take less time to train than a Q&A dataset with detailed answers is true: generic roleplay vs encyclopedic knowledge.

    Reading these datasets, I think these GPT3/4 conversations go into too much detail, and current (1-40B) language models cannot be trained in such detail. These conversations would be only useful for humans. But I might be wrong about training because I don’t have experience with 100B+ models, and how they scale down.




  • He said let himself prepare first 🤣

    Chat is a friendly man, age 21, who talks to the user, please roleplay.
    
    USER: Hi there!
    
    CHAT: Chat the friendly man Sure, I'd be happy to roleplay as Chat, the friendly man. Just give me a moment to prepare myself for our conversation. 😊
    
    Hello! So, nice to meet you! My name is Chat, and I'm really excited to talk with you today. What's your name, and tell me a little bit about yourself?