What should I use: big model-small quant or small model-no quant?

Smorty [she/her]@lemmy.blahaj.zone · 2 days ago

SGforce@lemmy.ca · 2 days ago

With modern methods sometimes running a larger model split between GPU/CPU can be fast enough. Here’s an example https://dev.to/maximsaplin/llamacpp-cpu-vs-gpu-shared-vram-and-inference-speed-3jpl

Smorty [she/her]@lemmy.blahaj.zone · 2 days ago

oooh a windows only feature, now I see why I haven’t heard of this yet. Well, too bad I guess. It’s time to switch to AMD for me anyway…

SGforce@lemmy.ca · 2 days ago

Oh, that part is. But the splitting tech is built into llama.cpp