LlamaPi: Experiments with VideoCore GPU

Ping Zhou, 2024-11-16

As mentioned in my previous posts, I’ve been trying to get LlamaPi to run with VideoCore GPU on Raspberry Pi, hoping to further boost generation speed.

Well, that effort might have just come to a conclusion… TL;DR is that VideoCore on Raspberry Pi is not well suited for such computation - in fact, it is even much slower than the ARM CPUs on Raspberry Pi.

In case I need them again, here are some of the records of my experiments.

llama.cpp on VideoCore (through Vulkan)

Build llama.cpp with debug options:

cmake -B build -DGGML_VULKAN=ON -DGGML_VULKAN_DEBUG=ON -DGGML_VULKAN_SHADER_DEBUG_INFO=ON -DGGML_VULKAN_RUN_TESTS=ON

Run llama.cpp with runtime debug options. I ran this on a Raspberry Pi 4 with 4GB RAM, so I used TinyLlama and only offloaded 1 layer to the GPU.

./build/bin/llama-cli -m ~/llm_models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p "Hello" -n 8 -t 1 -b 1 -c 12 -ngl 1

After a long time, it crashed with something like this:

STAGING
ggml_vk_sync_buffers()
ggml_vk_ctx_end(0x556746a600, 1)
ggml_vk_submit(0x556746a600, 0x5565a23820)
TEST F16_F32_ALIGNED_L m=32000 n=512 k=4096 batch=2 split_k=1 matmul 978352ms 0.000274375 TFLOPS avg_err=0
ggml_vk_queue_cleanup()
ggml_vk_queue_cleanup()
~vk_buffer_struct(0x5565c46c70, 524288000)
~vk_buffer_struct(0x5565146c60, 16777216)
~vk_buffer_struct(0x5565c46b20, 131072000)
ggml_pipeline_cleanup(matmul_f16_f32_aligned_l)
ggml_pipeline_cleanup(split_k_reduce)

/home/rpi/git/llama.cpp/ggml/src/ggml-vulkan.cpp:5701: fatal error

Looking at the code, it turned out that the GGML_VULKAN_RUN_TEST option caused this crash:

GGML_ABORT("fatal error");

I then disable the GGML_VULKAN_RUN_TEST option along with other debug options, and llama.cpp can run.

# offload only 1 layer to the GPU (ngl is 1)
./build/bin/llama-cli -m ~/llm_models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p "Hello" -n 8 -t 1 -b 1 -c 12 -ngl 1

However, it generated garbage data at a very low speed:

 Hellooure freoglivekionooureánicodocumentclass

llama_perf_sampler_print:    sampling time =       1.13 ms /    10 runs   (    0.11 ms per token,  8833.92 tokens per second)
llama_perf_context_print:        load time =    1596.87 ms
llama_perf_context_print: prompt eval time =    1814.35 ms /     2 tokens (  907.17 ms per token,     1.10 tokens per second)
llama_perf_context_print:        eval time =    6431.51 ms /     7 runs   (  918.79 ms per token,     1.09 tokens per second)
llama_perf_context_print:       total time =    8257.32 ms /     9 tokens

If I disable the GPU usage (by setting the ngl argument to 0), it generated normal text at a much higher speed:

./build/bin/llama-cli -m ~/llm_models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p "Hello" -n 8 -t 1 -b 1 -c 12 -ngl 0

 Hello, world! [end of text]


llama_perf_sampler_print:    sampling time =       0.41 ms /     6 runs   (    0.07 ms per token, 14527.85 tokens per second)
llama_perf_context_print:        load time =     990.88 ms
llama_perf_context_print: prompt eval time =     955.18 ms /     2 tokens (  477.59 ms per token,     2.09 tokens per second)
llama_perf_context_print:        eval time =    1438.12 ms /     3 runs   (  479.37 ms per token,     2.09 tokens per second)
llama_perf_context_print:       total time =    2398.69 ms /     5 tokens

As you can see, offloading even 1 layer to the GPU can slow down the generation by almost half!

I tried diffent ngl arguments (1, 2, 3) and it got slower as more layers are offloaded to GPU:

# of layers offload (`ngl`)	TPS
0	2.09
1	1.10
2	0.73
3	0.56

This made me doubt whether I should offload computation to VidoeCore GPU at all…

Performance tests of VideoCore

Fortunately, there is a nice tool that can compare the computation performance of VideoCore GPU against CPU: https://github.com/Idein/py-videocore6

It’s a Python package that directly communicates with the V3D hardware (/dev/dri/card0). After installation, I ran the example program of SGEMM:

==== sgemm example (1024x1024 times 1024x1024) ====
numpy: 0.1013 sec, 21.23 Gflop/s
QPU:   0.5935 sec, 3.624 Gflop/s
Minimum absolute error: 0.0
Maximum absolute error: 0.0003814697265625
Minimum relative error: 0.0
Maximum relative error: 0.13134673237800598

SGEMM is ~5.86 times faster on CPU (numpy) than VideoCore (QPU).

These experiments were done on Raspberry Pi 4. How about Raspberry Pi 5 (using VideoCore 7)?

Not looking good either according to this repo: https://github.com/Towdo/py-videocore7

According to the author:

Disclaimer: I’m currently not actively working on this anymore since there doesn’t seem to be any performance to be gained over the CPU. In fact, it seems to be challenging enough to just beat a single core of the CPU using the whole GPU in any real world task.

Conclusion

As far as I can tell, Raspberry Pi’s VideoCore GPU does not accelerate the generation speed of llama.cpp (not mentioning the garbage data problem, which could be some issue with GGML/Vulkan implementation). In fact, it is even much slower than CPU. A little sad but it seems to be true…

Appendix

NOTE: the VideoCore revision on my Raspberry Pi 4 seems to be 4.2:

cat /sys/kernel/debug/dri/0/v3d_ident

Revision:   4.2.14.0
MMU:        yes
TFU:        yes
TSY:        yes
MSO:        yes
L3C:        no (0kb)
Core 0:
  Revision:     4.2
  Slices:       2
  TMUs:         2
  QPUs:         8
  Semaphores:   0
  BCG int:      0

Is this correct? I thought Raspberry Pi 4 should be using VideoCore 6?

Another related config is the boot argument (/boot/firmware/config.txt).