LlamaPi Update - Llama-3.2 3B

Ping Zhou, 2024-10-01

I just updated LlamaPi with Llama-3.2 3B as its default local LLM.

Similar to Llama-3.1, I need to convert the model into gguf format, and then quantize it into different sizes. For 5-bit quantization, memory usage was reduced to ~2.7GB and generation speed reached 3.3 tokens/second. Compared to Llama-3.1 8B (4-bit quantized), the speedup is 1.83x.

Here is a comparison using llama.cpp CLI:

	tokens/second
Llama-3.1 8B (4-bit quantized)	1.8
Llama-3.2 3B (8-bit quantized)	2.5
Llama-3.2 3B (5-bit quantized)	3.3

Generation quality seemed to be similar as Llama-3.1 8B, but I haven’t had time to compare them extensively yet.

3.3 tokens/second is still a bit far from my target (10 tokens/second), but it’s definitely a good start. Maybe I should spend some time to get Vulkan working?