LlamaPi Update - Llama-3.2 3B
Ping Zhou, 2024-10-01
I just updated LlamaPi with Llama-3.2 3B as its default local LLM.
Similar to Llama-3.1, I need to convert the model into gguf format, and then quantize it into different sizes. For 5-bit quantization, memory usage was reduced to ~2.7GB and generation speed reached 3.3 tokens/second. Compared to Llama-3.1 8B (4-bit quantized), the speedup is 1.83x.
Here is a comparison using llama.cpp CLI:
tokens/second | |
---|---|
Llama-3.1 8B (4-bit quantized) | 1.8 |
Llama-3.2 3B (8-bit quantized) | 2.5 |
Llama-3.2 3B (5-bit quantized) | 3.3 |
Generation quality seemed to be similar as Llama-3.1 8B, but I haven’t had time to compare them extensively yet.
3.3 tokens/second is still a bit far from my target (10 tokens/second), but it’s definitely a good start. Maybe I should spend some time to get Vulkan working?