UP | HOME

Ping's Tech Notes

LlamaPi Robot - Voice chatbot with LLM and robot arm

Ping Zhou, 2024-09-20

Intro

Recently I built a prototype demonstrating the possibilities of Voice + LLM + Robotics. It is a voice chatbot running on Raspberry Pi 5 backed by the latest LLM (e.g. Llama-3.1), allowing the user to control robot arm gestures through voice interactions.

  • Backed by local LLM (Llama-3.1 8B) or cloud-based LLM.
  • Local ASR (faster_whisper) and TTS (piper).
  • Robot arm commands generated by LLM based on the context of the conversation.

The prototype won 1st and 3nd prizes at the recent InfiniEdge AI Hackathon.

Project code available on GitHub: https://github.com/zhoupingjay/LlamaPi

LlamaPi_demo.png

System Setup

Hardware:

Software:

  • Raspbian OS (Debian 12) desktop

Install

Dependencies

Create a virtual environment:

mkdir ~/.virtualenvs/
python3 -m venv ~/.virtualenvs/llamapi
source ~/.virtualenvs/llamapi/bin/activate

Install dependencies:

sudo apt install portaudio19-dev
sudo apt install libopenblas-dev libopenblas-pthread-dev libopenblas-openmp-dev libopenblas0 libopenblas0-pthread libopenblas0-openmp
sudo apt install libopenblas64-0 libopenblas64-dev libopenblas64-pthread-dev libopenblas64-openmp-dev
sudo apt install ccache build-essential cmake

Install Python modules:

pip install pyaudio wave soundfile
pip install faster_whisper numpy

# RPi.GPIO doesn't work, use rpi-lgpio as a drop-in replacement
pip uninstall RPi.GPIO
pip install rpi-lgpio

pip install opencc
pip install smbus

# For use with OpenBLAS:
CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install 'llama-cpp-python[server]'
pip install openai

Checkout the demo code:

git clone https://github.com/zhoupingjay/LlamaPi.git

Setup the LLM

Local LLM (Llama-3.1 8B Instruct)

  • Quantize the model to 4-bit so it can fit in the 8GB RAM on Raspberry Pi. You may use the quantization tool from llama.cpp, or use the GGUF-my-repo space on Hugging Face.
  • Create a llm folder under LlamaPi, and download the 4-bit quantized model (.gguf file) under this folder. E.g. llm/meta-llama-3.1-8b-instruct-q4_k_m.gguf.

Cloud-based LLM

  • I also built a demo using cloud-based LLM from Coze. Since Coze does not provide OpenAI-compatible APIs, a simple wrapper was developed.
  • To use the cloud-based LLM, you need to setup a bot in Coze. The prompt could be similar as the one used by the local LLM, but it could be more sophisticated as the cloud-based LLM is much more powerful.
  • Check out coze_demo.py for more details.

ASR

Use [[https://github.com/SYSTRAN/faster-whisper][faster_whisper]] installed from pip. It will download the ASR model on the first run. Subsequent runs will be all local.

TTS

  • Use piper for TTS.
  • Create a tts folder under LlamaPi, download and extract piper in this folder.
cd tts
wget https://github.com/rhasspy/piper/releases/download/v1.2.0/piper_arm64.tar.gz
tar zxf piper_arm64.tar.gz

The resulting directory structure would look like this:

LlamaPi
├── llm
|   └── (model .gguf file)
└── tts
    ├── piper
    |   └── (piper binaries)
    └── voices
        └── (voice files)

Usage

In your virtual environment, run the LlamaPi.py script:

python LlamaPi.py

You’ll see a window with big blue button and a text box showing the conversation.

Or if you run the cloud-based demo:

python coze_demo.py

(You need to set the environment variables COZE_APIKEY and COZE_BOTID to run this demo.)

The robot uses a “push-to-talk” mode for interaction: Hold the button, talk, and release the button after you finish. The robot will respond with text and voice.

The robot will also generate simple robot arm commands based on the context of your conversation:

  • If you say hello to the robot, it will generate a $greet command;
  • If you sounds happy, it will generate a $smile command;
  • If you sounds negative, it will generate a $pat command.
  • If you ask the robot to hand you over something, it will generate a $retrieve command to emulate the action of retrieving something.

These simple commands will result in different gestures from the robot arm.

Challenges and Takeaways

The biggest challenge is the performance of running an 8B model on a low-power edge device like Raspberry Pi. With 4-bit quantization, I was able to fit Llama-3.1 8B on the device, but the generation speed was about 1.8 tokens/second (using llama.cpp + OpenBLAS).

Several techniques (or tricks) were used to mitigate the impact on user experience, e.g.:

  • Limit the length of system prompt and responses. This prevents the local LLM version from using more sophisticated prompts.
  • I also temporarily disabled the conversation history in local LLM version to reduce the prefill time.
  • Use streaming mode in generation, and detect “end of sentence” on the fly. Once a sentence is finished, I call TTS immediately to speak to the user. This allows it to sound more responsive than waiting for the entire generation to be finished.

However, I need a more fundamental solution that can resolve the performance issue running LLM locally on Raspberry Pi. My target is to achieve 10 tokens/second.

Leveraging the VideoCore GPU on Raspberry Pi 5.

Raspberry Pi 5 has a VideoCore GPU that supports Vulkan. llama.cpp/ggml also has a Vulkan backend (ggml_vulkan.cpp), making this a (seemingly) viable option.

sudo apt install libgulkan-0.15-0 libgulkan-dev vulkan-tools libvulkan-dev libvkfft-dev libgulkan-utils glslc

To compile llama.cpp:

cmake -B build -DGGML_VULKAN=ON

Or if using llama-cpp-python binding:

CMAKE_ARGS="-DGGML_VULKAN=ON" pip install 'llama-cpp-python'
CMAKE_ARGS="-DGGML_VULKAN=ON" pip install 'llama-cpp-python[server]'

The VideoCore GPU on Raspberry Pi does not have enough memory for the entire model, but I can offload some of the layers to the GPU using the ngl argument. From my experiments, offloading 20 layers (out of the 32) could pass the initialization without OOM error.

Unfortunately, llama.cpp got stuck running the model. After some research, I tried disabling loop unrolling (by setting the V3D_DEBUG environment) and it seemed to go through:

V3D_DEBUG=noloopunroll ./build/bin/llama-cli -m <model.gguf> -p "Hello" -n 50 -ngl 20 -t 4 -c 512

However, the model generated corrupted data and it was even slower than CPU (probably because I disabled loop unrolling). :-(

I did some research and the hypothesis is that it might be something to do with the shader, which assumes 32/64 warp while Raspberry Pi has 16.

So far I didn’t have time to look further into this. Vulkan is new to me, so debugging this issue will be a bit challenging (and fun too!). Any advice will be appreciated.

Optimize CPU inference with LUT?

Idea inspired by the recent T-MAC paper, which uses LUTs (look-up tables) to replace arithmetic ops. This could be especially useful for low-bit quantized models. E.g. consider a multiplication between 8-bit and 4-bit numbers. If we pre-compute all possible combinations and save the results in a 256x16 table, then multiplication can be replaced by memory accesses.

I think this LUT idea makes a lot sense and might achieve significant performance boost on Raspberry Pi. In fact, the T-MAC paper showed some promising results on Raspberry Pi 5 already. Adopting this idea in my project will also be a direction that I’d be interested in looking into.

Further quantize the model to 2-bit?

I don’t prefer this… If you look at the help page of llama.cpp’s quantization tool, you’ll see that Q2 adds a lot more ppl than Q4.

 2  or  Q4_0    :  4.34G, +0.4685 ppl @ Llama-3-8B
 3  or  Q4_1    :  4.78G, +0.4511 ppl @ Llama-3-8B
 8  or  Q5_0    :  5.21G, +0.1316 ppl @ Llama-3-8B
 9  or  Q5_1    :  5.65G, +0.1062 ppl @ Llama-3-8B
......
10  or  Q2_K    :  2.96G, +3.5199 ppl @ Llama-3-8B
21  or  Q2_K_S  :  2.96G, +3.1836 ppl @ Llama-3-8B

Conclusion and Future Work

Despite the challenges, the project successfully demonstrated the potential of Voice + LLM + Robotics on a low-power edge device. Lots of work still need to be done to unleash the performance of Raspberry Pi. My target is to achieve 10 tokens/second with the 8B model. If you have any ideas or suggestions, please let me know!