LlamaPi: Experiments with VideoCore GPU
Ping Zhou, 2024-11-16
As mentioned in my previous posts, I’ve been trying to get LlamaPi to run with VideoCore GPU on Raspberry Pi, hoping to further boost generation speed.
Well, that effort might have just come to a conclusion… TL;DR is that VideoCore on Raspberry Pi is not well suited for such computation - in fact, it is even much slower than the ARM CPUs on Raspberry Pi.
In case I need them again, here are some of the records of my experiments.
llama.cpp on VideoCore (through Vulkan)
Build llama.cpp
with debug options:
cmake -B build -DGGML_VULKAN=ON -DGGML_VULKAN_DEBUG=ON -DGGML_VULKAN_SHADER_DEBUG_INFO=ON -DGGML_VULKAN_RUN_TESTS=ON
Run llama.cpp
with runtime debug options. I ran this on a Raspberry Pi 4 with 4GB RAM, so I used TinyLlama
and only offloaded 1 layer to the GPU.
./build/bin/llama-cli -m ~/llm_models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p "Hello" -n 8 -t 1 -b 1 -c 12 -ngl 1
After a long time, it crashed with something like this:
STAGING ggml_vk_sync_buffers() ggml_vk_ctx_end(0x556746a600, 1) ggml_vk_submit(0x556746a600, 0x5565a23820) TEST F16_F32_ALIGNED_L m=32000 n=512 k=4096 batch=2 split_k=1 matmul 978352ms 0.000274375 TFLOPS avg_err=0 ggml_vk_queue_cleanup() ggml_vk_queue_cleanup() ~vk_buffer_struct(0x5565c46c70, 524288000) ~vk_buffer_struct(0x5565146c60, 16777216) ~vk_buffer_struct(0x5565c46b20, 131072000) ggml_pipeline_cleanup(matmul_f16_f32_aligned_l) ggml_pipeline_cleanup(split_k_reduce) /home/rpi/git/llama.cpp/ggml/src/ggml-vulkan.cpp:5701: fatal error
Looking at the code, it turned out that the GGML_VULKAN_RUN_TEST
option caused this crash:
GGML_ABORT("fatal error");
I then disable the GGML_VULKAN_RUN_TEST
option along with other debug options, and llama.cpp
can run.
# offload only 1 layer to the GPU (ngl is 1) ./build/bin/llama-cli -m ~/llm_models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p "Hello" -n 8 -t 1 -b 1 -c 12 -ngl 1
However, it generated garbage data at a very low speed:
Hellooure freoglivekionooureƔnicodocumentclass llama_perf_sampler_print: sampling time = 1.13 ms / 10 runs ( 0.11 ms per token, 8833.92 tokens per second) llama_perf_context_print: load time = 1596.87 ms llama_perf_context_print: prompt eval time = 1814.35 ms / 2 tokens ( 907.17 ms per token, 1.10 tokens per second) llama_perf_context_print: eval time = 6431.51 ms / 7 runs ( 918.79 ms per token, 1.09 tokens per second) llama_perf_context_print: total time = 8257.32 ms / 9 tokens
If I disable the GPU usage (by setting the ngl
argument to 0), it generated normal text at a much higher speed:
./build/bin/llama-cli -m ~/llm_models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p "Hello" -n 8 -t 1 -b 1 -c 12 -ngl 0
Hello, world! [end of text] llama_perf_sampler_print: sampling time = 0.41 ms / 6 runs ( 0.07 ms per token, 14527.85 tokens per second) llama_perf_context_print: load time = 990.88 ms llama_perf_context_print: prompt eval time = 955.18 ms / 2 tokens ( 477.59 ms per token, 2.09 tokens per second) llama_perf_context_print: eval time = 1438.12 ms / 3 runs ( 479.37 ms per token, 2.09 tokens per second) llama_perf_context_print: total time = 2398.69 ms / 5 tokens
As you can see, offloading even 1 layer to the GPU can slow down the generation by almost half!
I tried diffent ngl
arguments (1, 2, 3) and it got slower as more layers are offloaded to GPU:
# of layers offload (ngl ) |
TPS |
---|---|
0 | 2.09 |
1 | 1.10 |
2 | 0.73 |
3 | 0.56 |
This made me doubt whether I should offload computation to VidoeCore GPU at all…
Performance tests of VideoCore
Fortunately, there is a nice tool that can compare the computation performance of VideoCore GPU against CPU: https://github.com/Idein/py-videocore6
It’s a Python package that directly communicates with the V3D hardware (/dev/dri/card0
). After installation, I ran the example program of SGEMM:
==== sgemm example (1024x1024 times 1024x1024) ==== numpy: 0.1013 sec, 21.23 Gflop/s QPU: 0.5935 sec, 3.624 Gflop/s Minimum absolute error: 0.0 Maximum absolute error: 0.0003814697265625 Minimum relative error: 0.0 Maximum relative error: 0.13134673237800598
SGEMM is ~5.86 times faster on CPU (numpy) than VideoCore (QPU).
These experiments were done on Raspberry Pi 4. How about Raspberry Pi 5 (using VideoCore 7)?
Not looking good either according to this repo: https://github.com/Towdo/py-videocore7
According to the author:
Disclaimer: I’m currently not actively working on this anymore since there doesn’t seem to be any performance to be gained over the CPU. In fact, it seems to be challenging enough to just beat a single core of the CPU using the whole GPU in any real world task.
Conclusion
As far as I can tell, Raspberry Pi’s VideoCore GPU does not accelerate the generation speed of llama.cpp
(not mentioning the garbage data problem, which could be some issue with GGML/Vulkan implementation). In fact, it is even much slower than CPU. A little sad but it seems to be true…
Appendix
NOTE: the VideoCore revision on my Raspberry Pi 4 seems to be 4.2:
cat /sys/kernel/debug/dri/0/v3d_ident Revision: 4.2.14.0 MMU: yes TFU: yes TSY: yes MSO: yes L3C: no (0kb) Core 0: Revision: 4.2 Slices: 2 TMUs: 2 QPUs: 8 Semaphores: 0 BCG int: 0
Is this correct? I thought Raspberry Pi 4 should be using VideoCore 6?
Another related config is the boot argument (/boot/firmware/config.txt
).