llama2.c笔记

For everything that could have been,
一切本都可实现,
At least we took the ride,
但至少我们经历了这一程,
There is no relief in bitterness,
无论怎么做都不会解脱,
Might as well let it die,
所以还是让它随风而逝吧。

一个纯计算项目总算迎来了C/C++版本

题图

引子

时值Qwen3、DeepSeek-R2发布之即,全球大语言模型(LLM)的竞争已从参数规模的追逐转向效率、推理与生态协同的深度博弈。回望2023年Meta推出的llama2,其以开源可商用的姿态撼动行业格局,不仅为后续模型的创新提供了技术蓝本,更通过“开放共享”的理念催化了AI民主化的进程,催生了qwen等一系列国产大语言模型的发展。学习llama2是很有必要的。

llama2.c 介绍

llama2.c是llama2 CPU推理的一个实现,主要特点有

  • 无依赖纯C实现,构建简单方便
  • 代码量少(980行),简洁且高可读性
  • 性能尚可,使用openmp多线程化,llama27B无量化速度2t/s(7900x+32G)
  • 推理内存占用少(<2GB),模型权重mmap仅内存,完全只读
  • 推理训练一体

作为读代码比读论文习惯的我来说非常适合用来做学习llama2的入口。本文不会介绍embedding和sample阶段的技术。

本文所有代码均为提交350e04f

Preview 数学函数介绍

rmsnorm

rmsnorm(Root Mean Square layer Normalization,均方根层归一化)是一种用于深度学习模型的归一化技术,旨在简化传统层归一化(LayerNorm)的计算流程,同时提升训练效率和稳定性。

182
183
184
185
186
187
188
189
190
191
192
193
194
195
void rmsnorm(float* o, float* x, float* weight, int size) {
// calculate sum of squares
float ss = 0.0f;
for (int j = 0; j < size; j++) {
ss += x[j] * x[j];
}
ss /= size;
ss += 1e-5f;
ss = 1.0f / sqrtf(ss);
// normalize and scale
for (int j = 0; j < size; j++) {
o[j] = weight[j] * (ss * x[j]);
}
}

代码中的rmsnorm实现直接参考自torch.nn.RMSNorm。代码中x,weight分别代表下面公式里的x,y。输出放在o向量中。

yi=xiRMS(x)γi,whereRMS(x)=ϵ+1ni=1nxi2y_i = \frac{x_i}{\mathrm{RMS}(x)} * \gamma_i, \quad \text{where} \quad \text{RMS}(x) = \sqrt{\epsilon + \frac{1}{n} \sum_{i=1}^{n} x_i^2}

softmax

经典归一层,就地运行。代码中为了减少exp调用的精度损失,使用了减最大值的算法。

197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
void softmax(float* x, int size) {
// find max value (for numerical stability)
float max_val = x[0];
for (int i = 1; i < size; i++) {
if (x[i] > max_val) {
max_val = x[i];
}
}
// exp and sum
float sum = 0.0f;
for (int i = 0; i < size; i++) {
x[i] = expf(x[i] - max_val);
sum += x[i];
}
// normalize
for (int i = 0; i < size; i++) {
x[i] /= sum;
}
}

Softmax(xi)=exp(xi)jexp(xj)=exp(xixmax)jexp(xjxmax)\text{Softmax}(x_{i}) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}=\frac{\exp(x_i-x_{max})}{\sum_j \exp(x_j-x_{max})}

matmul

llama2.c实现的并非完整的矩阵乘法,而是退化了的向量和矩阵的乘积。这个在之后的kvcache介绍章节会详细说。此处仅展示代码。

代码中使用了openmp并行化

217
218
219
220
221
222
223
224
225
226
227
228
229
void matmul(float* xout, float* x, float* w, int n, int d) {
// W (d,n) @ x (n,) -> xout (d,)
// by far the most amount of time is spent inside this little function
int i;
#pragma omp parallel for private(i)
for (i = 0; i < d; i++) {
float val = 0.0f;
for (int j = 0; j < n; j++) {
val += w[i * n + j] * x[j];
}
xout[i] = val;
}
}

xout=Wxshape(W)=(d,n);shape(x)=(n)xout = W * x \\ shape(W)=(d,n);shape(x)=(n)

Feed Forward Network

Feedforward neural network(前馈神经网络, FNN) 是一个很宽泛的概念,最基础的神经网络结构,其特点是:

  • 信息单向传递:数据从输入层 → 隐藏层 → 输出层逐层传递,不会反向流动。
  • 无记忆性:每一步的输出仅依赖当前输入,与历史状态无关。

在llama2.c中FFN又matmul实现,一个向量乘以一个矩阵就是一层全连接ffn。

《Attention is All you need》中的经典Transformer架构

在经典Transformer架构中,左边称作encoder,右边称作decoder。以词(word)作为输入(图上左下角input),数据流先进入encoder,encoder生成一个向量。这个向量再传递到decoder中间,再自回归的运行decoder部分生成词。

llama2使用了decoder-only架构,下图展示了llama2的decoder-only架构和经典Transformer架构的区别。

llama2去除了经典Transformer中的encoder部分。使得生成和训练任务都是自回归的,同时将经典Transformer架构中decoder部分的两个Attention缩减至一个Attention。

灵活的输入格式:由于Decoder-only模型本质上是根据给定的文本串生成输出,因此它们可以接受各种格式的输入。包括问题和回答、提示和续写、以及代码和其执行结果等。也就是说,无需特意对输入数据集进行“清洗”。

无需特定的任务架构:与Encoder-Decoder架构不同,Decoder-only模型不需要为不同类型的任务构建特定的encoder部分。也就是说,同一个模型可以在没有或仅需要少量修改的情况下,处理多种任务。

简化的预训练和微调过程:在预训练和微调阶段,没有繁琐的encoder过程,Decoder-only模型可以更加容易的进入训练过程。此外,由于训练过程主要关注如何基于给定的上下文生成文本,因此既不需要用户提供复杂的输入输出编码关系,也不需要专门处理这些复杂的映射。

易于扩展性:由于结构的简单和统一,Decoder-only模型通常更容易扩展到更大的模型尺寸,有助于提升模型的性能和适应性。这也就是去年涌现出的众多LLM,参数数量能够不断攀上新高的主要原因之一。

作者:佳人李大花
链接:https://www.zhihu.com/question/588325646/answer/3383505083

llama2.c导读

llama.c程序入口

忽略掉输入参数处理,main函数非常简单。基本就是构造Transfromer、Tokenizer和Sampler实例,然后调用对应模式函数。重点是负责生成任务的generate()函数。

944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
// build the Transformer via the model .bin file
Transformer transformer;
build_transformer(&transformer, checkpoint_path);
if (steps == 0 || steps > transformer.config.seq_len) steps = transformer.config.seq_len; // override to ~max length

// build the Tokenizer via the tokenizer .bin file
Tokenizer tokenizer;
build_tokenizer(&tokenizer, tokenizer_path, transformer.config.vocab_size);

// build the Sampler
Sampler sampler;
build_sampler(&sampler, transformer.config.vocab_size, temperature, topp, rng_seed);

// run!
if (strcmp(mode, "generate") == 0) {
generate(&transformer, &tokenizer, &sampler, prompt, steps);
} else if (strcmp(mode, "chat") == 0) {
chat(&transformer, &tokenizer, &sampler, prompt, system_prompt, steps);
} else {
fprintf(stderr, "unknown mode: %s\n", mode);
error_usage();
}

// memory and file handles cleanup
free_sampler(&sampler);
free_tokenizer(&tokenizer);
free_transformer(&transformer);

Transformer实例保存着模型的超参数配置,权重(build完成后只读),运行时状态以及mmap元信息。

67
68
69
70
71
72
73
74
75
typedef struct {
Config config; // the hyperparameters of the architecture (the blueprint)
TransformerWeights weights; // the weights of the model
RunState state; // buffers for the "wave" of activations in the forward pass
// some more state needed to properly clean up the memory mapping (sigh)
int fd; // file descriptor for memory mapping
float* data; // memory mapped data pointer
ssize_t file_size; // size of the checkpoint file in bytes
} Transformer;

Tokenizer实例保存着模型分词器的信息。其中包含词表,词分数信息。llama.c为此实现了SentencePiece运行时算法。

Sampler实例保存着模型温度,topp以及随机数种子等参数。用于从最终softmax生成的词表logits选取输出词。

Transformer类型

我们先从模型超参数看起,dim表示基础向量长度;hidden_dim表示Feed Forward层隐藏层维度;n_layers表示层数(即上图中右边的Nx);n_heads表示头数量(即QKV前面划分基础向量的头数量);n_kv_heads表示KV的内在维度;vocab_size表示词表长度。

19
20
21
22
23
24
25
26
27
typedef struct {
int dim; // transformer dimension
int hidden_dim; // for ffn layers
int n_layers; // number of layers
int n_heads; // number of query heads
int n_kv_heads; // number of key/value heads (can be < query heads because of multiquery)
int vocab_size; // vocabulary size, usually 256 (byte-level)
int seq_len; // max sequence length
} Config;

generate()

构造完Transfromer、Tokenizer和Sampler实例后,generate函数负责整合推理过程。

  1. 调用tokenizer进行embedding,得到编码后的向量
  2. 循环调用transformer的forward()函数推理,得到logits
    1. 未超过prompt就强制prompt作为下一个token(prefill阶段)
    2. 超过输入prompt就使用logits调用sampler,得到下一个token(decode阶段)
    3. 把token解码输出
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
void generate(Transformer *transformer, Tokenizer *tokenizer, Sampler *sampler, char *prompt, int steps) {
char *empty_prompt = "";
if (prompt == NULL) { prompt = empty_prompt; }

// encode the (string) prompt into tokens sequence
int num_prompt_tokens = 0;
int* prompt_tokens = (int*)malloc((strlen(prompt)+3) * sizeof(int)); // +3 for '\0', ?BOS, ?EOS
encode(tokenizer, prompt, 1, 0, prompt_tokens, &num_prompt_tokens);
if (num_prompt_tokens < 1) {
fprintf(stderr, "something is wrong, expected at least 1 prompt token\n");
exit(EXIT_FAILURE);
}

// start the main loop
long start = 0; // used to time our code, only initialized after first iteration
int next; // will store the next token in the sequence
int token = prompt_tokens[0]; // kick off with the first token in the prompt
int pos = 0; // position in the sequence
while (pos < steps) {

// forward the transformer to get logits for the next token
float* logits = forward(transformer, token, pos);

// advance the state machine
if (pos < num_prompt_tokens - 1) {
// if we are still processing the input prompt, force the next prompt token
next = prompt_tokens[pos + 1];
} else {
// otherwise sample the next token from the logits
next = sample(sampler, logits);
}
pos++;

// data-dependent terminating condition: the BOS (=1) token delimits sequences
if (next == 1) { break; }

// print the token as string, decode it with the Tokenizer object
char* piece = decode(tokenizer, token, next);
safe_printf(piece); // same as printf("%s", piece), but skips "unsafe" bytes
fflush(stdout);
token = next;

// init the timer here because the first iteration can be slower
if (start == 0) { start = time_in_ms(); }
}
printf("\n");

// report achieved tok/s (pos-1 because the timer starts after first iteration)
if (pos > 1) {
long end = time_in_ms();
fprintf(stderr, "achieved tok/s: %f\n", (pos-1) / (double)(end-start)*1000);
}

free(prompt_tokens);
}

forward()

forward()里面非常简单,准备一些参数(比如词向量x)之后就调用每层的计算,每层把结果累加到长度为dim的x向量中,不断迭代这个x。最终rmsnorm归一化一下,乘个词表权重矩阵输出logits就完事了。我们来看看每层具体干了什么

231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
float* forward(Transformer* transformer, int token, int pos) {

// a few convenience variables
Config* p = &transformer->config;
TransformerWeights* w = &transformer->weights;
RunState* s = &transformer->state;
float *x = s->x;
int dim = p->dim;
int kv_dim = (p->dim * p->n_kv_heads) / p->n_heads;
int kv_mul = p->n_heads / p->n_kv_heads; // integer multiplier of the kv sharing in multiquery
int hidden_dim = p->hidden_dim;
int head_size = dim / p->n_heads;

// copy the token embedding into x
float* content_row = w->token_embedding_table + token * dim;
memcpy(x, content_row, dim*sizeof(*x));

// forward all the layers
for(unsigned long long l = 0; l < p->n_layers; l++) {
//...
}

// final rmsnorm
rmsnorm(x, x, w->rms_final_weight, dim);

// classifier into logits
matmul(s->logits, x, w->wcls, p->dim, p->vocab_size);
return s->logits;
}
第一次rmsnorm
248
249
250
251
252
// forward all the layers
for(unsigned long long l = 0; l < p->n_layers; l++) {

// attention rmsnorm
rmsnorm(s->xb, x, w->rms_att_weight + l*dim, dim);
QKV,Attention,GQA

QKV是从x向量(对于多个token就是一个矩阵)分别乘三个权重矩阵实现。llama2选择在此处使用RoPE向Q,K的每行长度为dim的向量注入位置信息。

Q=RoPE(XWQ),K=RoPE(XWK),V=XWVshape(X)=(seq_len,dim)shape(WQ)=(dim,dim),shape(WK)=shape(WV)=(dim,kv_dim)shape(Q)=(seq_len,dim),shape(K)=shape(V)=(seq_len,Kv_dim)Q=\text{RoPE}(XW^Q),K=\text{RoPE}(XW^K),V=XW^V \\ shape(X)=(seq{\_}len,dim) \\ shape(W^Q)=(dim,dim),shape(W^K)=shape(W^V)=(dim,kv{\_}dim) \\ shape(Q)=(seq{\_}len,dim),shape(K)=shape(V)=(seq{\_}len,Kv{\_}dim)

254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
// key and value point to the kv cache
int loff = l * p->seq_len * kv_dim; // kv cache layer offset for convenience
s->k = s->key_cache + loff + pos * kv_dim;
s->v = s->value_cache + loff + pos * kv_dim;

// qkv matmuls for this position
matmul(s->q, s->xb, w->wq + l*dim*dim, dim, dim);
matmul(s->k, s->xb, w->wk + l*dim*kv_dim, dim, kv_dim);
matmul(s->v, s->xb, w->wv + l*dim*kv_dim, dim, kv_dim);

// RoPE relative positional encoding: complex-valued rotate q and k in each head
for (int i = 0; i < dim; i+=2) {
int head_dim = i % head_size;
float freq = 1.0f / powf(10000.0f, head_dim / (float)head_size);
float val = pos * freq;
float fcr = cosf(val);
float fci = sinf(val);
int rotn = i < kv_dim ? 2 : 1; // how many vectors? 2 = q & k, 1 = q only
for (int v = 0; v < rotn; v++) {
float* vec = v == 0 ? s->q : s->k; // the vector to rotate (query or key)
float v0 = vec[i];
float v1 = vec[i+1];
vec[i] = v0 * fcr - v1 * fci;
vec[i+1] = v0 * fci + v1 * fcr;
}
}

然后分头计算Attention,这里的Mask是下三角矩阵变换函数,非下三角部分全部设为0。

llama2使用了分组查询注意力(Grouped Query Attention, GQA),即分头后多个Q头共用一个KV头(这也就说明kv_dimdim的因数),实现使用了kvcache加速。

Attentionh(Q,K,V)=softmax(Mask(QhKkvhT)dk)Vkvhshape(Qh)=(seq_len,head_size)shape(Kkvh)=shape(Vkvh)=(seq_len,head_size(dim/kv_dim))\text{Attention}_h(Q,K,V)=\text{softmax}(\frac{\text{Mask}(Q_hK_{kvh}^T)}{\sqrt{d_k}})V_{kvh} \\ shape(Q_h)=(seq{\_}len,head{\_}size) \\ shape(K_{kvh})=shape(V_{kvh})=(seq{\_}len,head{\_}size*(dim/kv{\_}dim)) \\

281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
// multihead attention. iterate over all heads
int h;
#pragma omp parallel for private(h)
for (h = 0; h < p->n_heads; h++) {
// get the query vector for this head
float* q = s->q + h * head_size;
// attention scores for this head
float* att = s->att + h * p->seq_len;
// iterate over all timesteps, including the current one
for (int t = 0; t <= pos; t++) {
// get the key vector for this head and at this timestep
float* k = s->key_cache + loff + t * kv_dim + (h / kv_mul) * head_size;
// calculate the attention score as the dot product of q and k
float score = 0.0f;
for (int i = 0; i < head_size; i++) {
score += q[i] * k[i];
}
score /= sqrtf(head_size);
// save the score to the attention buffer
att[t] = score;
}

// softmax the scores to get attention weights, from 0..pos inclusively
softmax(att, pos + 1);

// weighted sum of the values, store back into xb
float* xb = s->xb + h * head_size;
memset(xb, 0, head_size * sizeof(float));
for (int t = 0; t <= pos; t++) {
// get the value vector for this head and at this timestep
float* v = s->value_cache + loff + t * kv_dim + (h / kv_mul) * head_size;
// get the attention weight for this timestep
float a = att[t];
// accumulate the weighted value into xb
for (int i = 0; i < head_size; i++) {
xb[i] += a * v[i];
}
}
}

分头计算完毕,xb就是所有头的concat。再来个全连接+残差,Attention阶段就结束了。

321
322
323
324
325
326
327
// final matmul to get the output of the attention
matmul(s->xb2, s->xb, w->wo + l*dim*dim, dim, dim);

// residual connection back into x
for (int i = 0; i < dim; i++) {
x[i] += s->xb2[i];
}
kvcache

观察上述式子

Attentionh(Q,K,V)=softmax(Mask(QhKkvhT)dk)Vkvh\text{Attention}_h(Q,K,V)=\text{softmax}(\frac{\text{Mask}(Q_hK_{kvh}^T)}{\sqrt{d_k}})V_{kvh}

容易发现Mask(QhKkvhT)\text{Mask}(Q_hK_{kvh}^T)因为有Mask\text{Mask}的存在得到的是一个(seq_len,seq_len)形状的下三角矩阵,序列增长时seq_len=seq_len+1seq{\_}len'=seq{\_}len+1,答案最终会增长一行长度为seq_len+1seq{\_}len+1的向量。这个向量完全来源QhQ_h的最新一行和KkvhK_{kvh}整体矩阵的乘积。因此可以缓存计算过程中所有KKQQ无需缓存,就能得到正确答案。

得到Mask\text{Mask}后的向量后,乘以VkvhV_{kvh}的结果同理,依然可以缓存整体VkvhV_{kvh},拿新向量乘VkvhV_{kvh}矩阵即可得到Attentionh\text{Attention}_h新的一行,之后计算直接用这个新向量就行了,反正之后也是累加起来的。

FFN

先rmsnorm一下,然后经过两层带激活函数的神经网络,最终残差回来。

值得注意的是w1w_1w3w_3的形状是(dim,hidden_dim)(dim,hidden{\_}dim),起到升维的作用。能更好的分析Attention感知到的信息。

329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
    // ffn rmsnorm
rmsnorm(s->xb, x, w->rms_ffn_weight + l*dim, dim);

// Now for FFN in PyTorch we have: self.w2(F.silu(self.w1(x)) * self.w3(x))
// first calculate self.w1(x) and self.w3(x)
matmul(s->hb, s->xb, w->w1 + l*dim*hidden_dim, dim, hidden_dim);
matmul(s->hb2, s->xb, w->w3 + l*dim*hidden_dim, dim, hidden_dim);

// SwiGLU non-linearity
for (int i = 0; i < hidden_dim; i++) {
float val = s->hb[i];
// silu(x)=x*σ(x), where σ(x) is the logistic sigmoid
val *= (1.0f / (1.0f + expf(-val)));
// elementwise multiply with w3(x)
val *= s->hb2[i];
s->hb[i] = val;
}

// final matmul to get the output of the ffn
matmul(s->xb, s->hb, w->w2 + l*dim*hidden_dim, hidden_dim, dim);

// residual connection
for (int i = 0; i < dim; i++) {
x[i] += s->xb[i];
}
}
输出logits

所有层计算结束,把最终x rmsnorm一下最终乘个矩阵从dimdim缩小到词表维度vocab_sizevocab{\_}size

356
357
358
359
360
361
// final rmsnorm
rmsnorm(x, x, w->rms_final_weight, dim);

// classifier into logits
matmul(s->logits, x, w->wcls, p->dim, p->vocab_size);
return s->logits;

总结

对于llama2-7b来说,模型参数是这些

dim=4096hidden_dim=11008n_layers=32n_heads=32n_kv_heads=32vocab_size=32000seq_len=4096head_size=dim/n_heads=128\begin{aligned} & dim=4096 \\ & hidden{\_}dim=11008 \\ & n{\_}layers=32 \\ & n{\_}heads=32 \\ & n{\_}kv{\_}heads=32 \\ & vocab{\_}size=32000 \\ & seq{\_}len=4096 \\ & head{\_}size=dim/n{\_}heads=128 \\ \end{aligned}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
dim = 4096
hidden_dim = 11008
n_layers = 32
n_heads = 32
n_kv_heads = 32
vocab_size = 32000
seq_len = 4096
head_size = dim/n_heads
w = {
# token嵌入表(词元嵌入表)
"token_embedding_table": vocab_size * dim,
# RMS标准化权重(用于注意力机制和前馈网络)
"rms_att_weight": n_layers * dim,
"rms_ffn_weight": n_layers * dim,
# 矩阵乘法权重
"wq": n_layers * dim * (n_heads * head_size), # Q矩阵权重
"wk": n_layers * dim * (n_kv_heads * head_size), # K矩阵权重
"wv": n_layers * dim * (n_kv_heads * head_size), # V矩阵权重
"wo": n_layers * (n_heads * head_size) * dim, # 输出矩阵权重
# 前馈网络权重
"w1": n_layers * dim * hidden_dim, # 第一个线性变换权重
"w2": n_layers * hidden_dim * dim, # 第二个线性变换权重
"w3": n_layers * dim * hidden_dim, # 第三个线性变换权重
# 最终的RMS标准化权重(输出层前的标准化)
"rms_final_weight": dim,
# 分类器权重(用于最后一层的logits输出)
"wcls": dim * vocab_size,
}
print(w)
print("%.1e" % sum(w.values()))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
6.7e+09
{
token_embedding_table: "1.3e+08",
rms_att_weight: "1.3e+05",
rms_ffn_weight: "1.3e+05",
wq: "5.4e+08",
wk: "5.4e+08",
wv: "5.4e+08",
wo: "5.4e+08",
w1: "1.4e+09",
w2: "1.4e+09",
w3: "1.4e+09",
rms_final_weight: "4.1e+03",
wcls: "1.3e+08",
}

总参数量6.7B

Performance

Reference

  1. Attention Is All You Need
  2. Llama 2: Open Foundation and Fine-Tuned Chat Models
  3. [KV Cache优化]🔥MQA/GQA/YOCO/CLA/MLKV笔记: 层内和层间KV Cache共享-DefTruth
  4. Understanding Llama2: KV Cache, Grouped Query Attention, Rotary Embedding and More-CheeKean
  5. karpathy/llama2.c