Tech Notes

基于 Emacs org-mode 和 hugo 的 Blog 工作流

Introduction 我经常使用 Emacs Org-mode 来写blog。虽然有很多其他笔记工具和方案，但我一直觉得没法取代 Org-mode，比如Org-mode里的 Babel 代码块，内嵌LaTex公式，GUI下图表预览，表格自动求值等功能，实在是很方便。Emacs Org-mode 也天然支持将众多 org 文件笔记通过链接组织成个人知识库。不过，现有的工作流基于标准的 ox-publish 导出 HTML，样式略显陈旧，如果支持更美观的主题，需要很多手动配置。最近对orgmode工作进行了一次升级：引入 Hugo 渲染引擎（搭配 PaperMod 主题），同时保留了纯粹的 Org-mode 创作体验。工作流的演进过去：Org -> HTML (ox-publish) 优点：流程简单，完全受 Emacs 控制。缺点：网页样式极其基本，移动端适配差，缺乏搜索和标签功能。现在：Org -> HTML Fragment -> Hugo (PaperMod) 改进：采用 Hugo 作为渲染引擎，利用 PaperMod 提供现代化的 UI。核心逻辑：利用 Emacs 的 ox-html 将 Org 导出为纯净的 HTML 片段，再交给 Hugo 进行最终装配。关键技术点 1. 自动化导出脚本：publish-hugo.el 这是整个流程的“心脏”。它是一个 Emacs Lisp 脚本，通过 Org 内置的 ox-html 导出器实现精准渲染，同时负责元数据注入和路径修正。 (defun pz/export-all-to-hugo () (dolist (section '("notes" "quantum" "me")) (dolist (file (directory-files (expand-file-name section) t "\\.org$")) (with-current-buffer (find-file-noselect file) ;; 导出为 HTML 片段 (let ((html-content (org-export-as 'html nil nil t))) ;; 修复资源路径 (setq html-content (replace-regexp-in-string "\\.\\./images/" "/images/" html-content)) ;; 写入 Hugo content 目录 (with-temp-file dest-file (insert "---\n" (format "title: \"%s\"\n" title) "math: true\n---\n\n") (insert html-content))))))) 2. 构建与部署：build.sh & deploy.sh 实现“一键发布”的两个 Bash 脚本： ...

LlamaPi 初步试用 RWKV, Ollama

最近RWKV有不少进展，RWKV-7出了思考模型 rwkv-7-world-g1 ，RWKV-8也即将发布。 RWKV的模型架构，在计算量和内存上相比Transformer有很大的优势，对于LlamaPi这种在边缘设备上运行的应用很有吸引力。因此尝试在LlamaPi上适配RWKV，看看效果如何。目前初步尝试，总的来说不太成功，可能是使用方法或适配上还有些问题，模型始终不能很好的遵循系统指令，生成符合要求的响应。 llama.cpp运行按照文档步骤：编译最新的llama.cpp 下载 rwkv7-g1-1.5b-20250429-ctx4096 ，转换成gguf，Q80量化用 llama-cli 命令行运行，提示词参考LlamaPi： ./build/bin/llama-cli -m models/rwkv7-g1-1.5b-20250429-ctx4096-Q8_0.gguf -p "Your name is Skyler. You are a helpful assistant.\n\n" -cnv -t 4 -ngl 99 -n 500 启动后本来应该就进入对话的，但它先进入了自嗨模式，输出了一大通不相关的东西，然后才停下进入提示符。这可能还是因为我设了n=500的参数，否则不知道它什么时候会停下：然后检查一下提示词的有效性，结果是完全没用： > hello, what's your name? My name is [My name] > who are you? My name is [My name] > 命令行不行，那试试Web界面？ /build/bin/llama-server -m models/rwkv7-g1-1.5b-20250429-ctx4096-Q8_0.gguf 先设置一下系统提示：试一下，这次系统提示词似乎有效了。试试LlamaPi的完整提示词： ...

LlamaPi: Experiments with VideoCore GPU

As mentioned in my previous posts, I've been trying to get LlamaPi to run with VideoCore GPU on Raspberry Pi, hoping to further boost generation speed. Well, that effort might have just come to a conclusion… TL;DR is that VideoCore on Raspberry Pi is not well suited for such computation - in fact, it is even much slower than the ARM CPUs on Raspberry Pi. In case I need them again, here are some of the records of my experiments. ...

LlamaPi Update - Gemini Support

LlamaPi now supports Gemini as its backing LLM, in addition to local LLM and Coze. On Gemini side: Test the prompt (system instruction) for using with LlamaPi. Create an API key (may need to create a project first). On LlamaPi side: Create a wrapper using Gemini Python SDK. In order to support multiple backing LLMs, major refactor was done to the LlamaPi code. Create a common base class LlamaPiBase for all three scenarios (local, Coze, Gemini). The base class includes the common functionalities like the UI, audio, ASR, etc. Created subclasses for local LLM, Coze and Gemini respectively. They extend the base class and implement the logic for interacting with backing LLMs. With this refactor, it will be easier for me to add support for other LLMs in the furture.

LlamaPi Update - Llama-3.2 3B

I just updated LlamaPi with Llama-3.2 3B as its default local LLM. Similar to Llama-3.1, I need to convert the model into gguf format, and then quantize it into different sizes. For 5-bit quantization, memory usage was reduced to ~2.7GB and generation speed reached 3.3 tokens/second. Compared to Llama-3.1 8B (4-bit quantized), the speedup is 1.83x. Here is a comparison using llama.cpp CLI: tokens/second Llama-3.1 8B (4-bit quantized) 1.8 Llama-3.2 3B (8-bit quantized) 2.5 Llama-3.2 3B (5-bit quantized) 3.3 Generation quality seemed to be similar as Llama-3.1 8B, but I haven't had time to compare them extensively yet. ...

LlamaPi Robot - Voice chatbot with LLM and robot arm

Intro Recently I built a prototype demonstrating the possibilities of Voice + LLM + Robotics. It is a voice chatbot running on Raspberry Pi 5 backed by the latest LLM (e.g. Llama-3.1), allowing the user to control robot arm gestures through voice interactions. Backed by local LLM (Llama-3.1 8B) or cloud-based LLM. Local ASR (faster_whisper) and TTS (piper). Robot arm commands generated by LLM based on the context of the conversation. The prototype won 1st and 3nd prizes at the recent InfiniEdge AI Hackathon. ...

[论文解读] DistServe分离式架构优化大模型推理服务性能

这是一篇北大和UCSD合作的论文，主题是优化大模型推理服务的性能。在前文（RWKV）中提到过，对于大语言模型（LLM）来说，更大的挑战是推理。因为推理成本属于OpEx，用户使用越多，花费就越大。降低LLM的推理服务成本，是LLM应用在商业上可持续的关键之一。 LLM推理服务对性能的要求，也与训练有所不同。推理服务更注重响应的延迟，由于直接面向用户，响应延迟直接影响用户体验。例如对于聊天机器人类型的应用，需要尽快的开始回答，也就是说要尽快生成输出第一个token，之后的token，需要能跟上人的阅读速度；而对于代码生成类的应用，则需要更快的端到端生成速度，以支持实时的代码提示。目前的大模型推理服务系统，都是以吞吐（throughput）为标准来优化的，也就是单位时间服务的用户请求数（request per second, rps）。这个指标，也被用作推理成本优化的一个目标，因为更高的吞吐，意味着可以用更少的GPU时间服务更多的用户请求。作者提出，简单的用吞吐作为指标是不够的。在实际场景中，应用对推理服务有不同的质量目标（service level objectives, SLO），对于LLM推理服务而言，最重要的有两个SLO： TTFT (Time To First Token): LLM生成第一个token需要的时间 TPOT (Time Per Output Token): LLM生成两个token之间的平均延迟显然，TTFT表示LLM多久开始回答，而TPOT表示LLM的语速。作者认为，LLM推理服务的质量应该看Goodput，可以理解为“有效吞吐”。Goodput的意思是：对于每个分配的GPU，在满足SLO目标的前提下（例如 90% TTFT < 200ms，90% TPOT < 50ms），所能达到的最大吞吐。更高的Goodput，意味着每个请求的服务成本更低。 maximum request rate that can be served adhering to the SLO attainment goal (say, 90%) for each GPU provisioned – higher per-GPU goodput directly translates into lower cost per query. 更高的throughput，并不一定意味着更高的goodput。为什么生成第一个token会花费更久呢？因为在服务用户的时候，LLM需要把用户的历史对话作为上下文，加上用户当前的问题（请求），作为prompt去推理，因此生成第一个token需要计算很长的prompt。而后续的token，因为LLM推理普遍采用KV cache缓存前一次生成的中间结果，避免重复计算，每次生成token计算量要比第一个token小得多。关于KV cache，简单说几句：不要把它混淆为Key Value，这里的KV指的是Transformer的K，V矩阵。KV cache的原理是，自回归的LLM在生成时，每次生成的上下文，和上一次生成只差一个token，因此有大量计算是重复的。如果把上一次生成时的中间结果（主要是K，V矩阵）记下来，那么下一次生成的时候就可以避免重复计算，大大降低推理成本。这个技术目前已经是标配了，基本上所有的LLM推理都会用到它。 ...

谨慎使用C语言里的联合(union)和位域(bit field)！

对于内核、驱动、嵌入式系统等底层开发来说，C语言的bit field（位域）和union（联合）都是常用的特性。位域可以让我们在结构体中指定某些成员占多少位，这在同硬件打交道的时候特别有用。例如硬件要求某个32位的消息里，第31位是flag，其余是value，用位域定义的数据结构： typedef struct msg_t { uint32_t flag : 1; uint32_t value : 31; } msg_t; 在程序里可以直接操作结构体成员那样访问flag和value，而不用手动去对32位消息进行位操作，这些编译器都给我们做了。我们还可以加上联合（union），使得我们既可以访问里面的成员，也可以按照一个32位数访问整个消息： typedef struct msg_fields_t { uint32_t flag : 1; uint32_t value : 31; } msg_fields_t; typedef struct msg_t { union { uint32_t raw; msg_fields_t fields; }; } msg_t; union告诉编译器，raw和fields这两个成员在结构体里占用同样的内存地址。因为这两个成员都是32位，因此raw就是整个32位的消息，而通过fields可以访问该消息的flags和value。但是，在同时使用union和bit field的时候要注意，union和bit field如果互相套在一起，编译器产生的内存排列可能和你想的不一样！ ...

My Little Reflections on Optane

In early 2023, Intel announced the discontinuity of Optane products (including SSD and memory). While not quite surprising considering their business environment, it’s still a bit of disappointment to me. As we are closing to the end of 2023, I decided to take some time and write down some of my reflections on Optane’s journey. [Disclaimer] All contents & discussions in this article are just my personal opinions and do not represent any organization or institution. ...

三国GPT (SanGuo GPT) v0.1

Overview SanGuo GPT is a Large Language Model trained on 三国演义 (San1 Guo2 Yan3 Yi4, Romance of the Three Kindoms), an ancient Chinese novel based on historical events happened ~1800 years ago during the late Han dynasty. It is a Transformer-based model with about 13.778M parameters in current version. I created this project for learning and exploration purpose. I'd like to try out the LLM application building process end-to-end, including major steps like data ingestion & preprocessing, shuffling/sampling, model building & training, visualization, model checkpointing and model serving. I want to explore the idea of "书读千遍，其义自现" (something like "if you read a book a thousand times, the meaning and implications will emerge by itself"). This idea popped up when I chat with a friend, and I found it very interesting. What if I train the model with data from just one book and iterate many steps? How would the model behave after intensive training on a single book? I also plan to use this project as a vehicle for playing with other new ideas - stay tuned! ...