llmnpc - llama.cpp/tools/quantize/README.md

Path: llmnpc / llama.cpp / tools / quantize / README.md (raw)
  1# quantize
  2
  3This tool takes a GGUF input model file, typically in a high-precision format like F32 or BF16, and converts it to a quantized format.
  4Quantization reduces the precision of model weights (e.g., from 32-bit floats to 4-bit integers), which shrinks the model's size and can speed up inference.
  5This process however, may introduce some accuracy loss which is usually measured in [Perplexity](https://huggingface.co/docs/transformers/en/perplexity) (ppl) and/or [Kullback–Leibler Divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) (kld).
  6This can be minimized by using a suitable imatrix file.
  7
  8You can also use the [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space on Hugging Face to build your own quants without any setup.
  9
 10Note: It is synced from llama.cpp `main` every 6 hours.
 11
 12Example usage:
 13
 14```./llama-quantize [options] input-model-f32.gguf [output-model-quant.gguf] type [threads]```
 15
 16```bash
 17# from Hugginface, obtain the official meta-llama/Llama-3.1-8B model weights and place them in ./models
 18ls ./models
 19config.json             model-00001-of-00004.safetensors  model-00004-of-00004.safetensors  README.md                tokenizer.json
 20generation_config.json  model-00002-of-00004.safetensors  model.safetensors.index.json      special_tokens_map.json  USE_POLICY.md
 21LICENSE                 model-00003-of-00004.safetensors  original                          tokenizer_config.json
 22
 23# [Optional] for PyTorch .bin models like Mistral-7B
 24ls ./models
 25<folder containing weights and tokenizer json>
 26
 27# install Python dependencies
 28python3 -m pip install -r requirements.txt
 29
 30# convert the model to ggml FP16 format
 31python3 convert_hf_to_gguf.py ./models/mymodel/
 32
 33# quantize the model to 4-bits (using Q4_K_M method)
 34./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
 35
 36# update the gguf filetype to current version if older version is now unsupported
 37./llama-quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY
 38```
 39
 40Run the quantized model:
 41
 42```bash
 43# start inference on a gguf model
 44./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -cnv -p "You are a helpful assistant"
 45```
 46
 47Options:
 48* `--allow-requantize` allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
 49* `--leave-output-tensor` will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
 50* `--pure` disables k-quant mixtures and quantizes all tensors to the same type
 51* `--imatrix` uses data in file generated by `llama-imatrix` as importance matrix for quant optimizations (highly recommended)
 52* `--include-weights` use an importance matrix for tensor(s) in the list. Cannot be used with `--exclude-weights`
 53* `--exclude-weights` use an importance matrix for tensor(s) in the list. Cannot be used with `--include-weights`
 54* `--output-tensor-type` use a specific quant type for the output.weight tensor
 55* `--token-embedding-type` use a specific quant type for the token embeddings tensor
 56* `--keep-split` will generate the quantized model in the same shards as the input file otherwise it will produce a single quantized file
 57
 58Advanced options:
 59* `--tensor-type` quantize specific tensor(s) to specific quant types. Supports regex syntax. May be specified multiple times.
 60* `--prune-layers` prune (remove) the layers in the list
 61* `--override-kv` option to override model metadata by key in the quantized model. May be specified multiple times
 62
 63Examples:
 64
 65```bash
 66# naive Q4_K_M quantization using default settings and 8 CPU threads. Output will be "ggml-model-Q4_K_M.gguf"
 67./llama-quantize input-model-f32.gguf q4_k_m 8
 68```
 69
 70```bash
 71#  quantize model enabling re-quantization, leaving the output tensor unquantized and all others quantized at the same level (Q4_K)
 72./llama-quantize --allow-requantize --leave-output-tensor --pure input-model-f32.gguf q4_k_m 8
 73```
 74
 75```bash
 76# quantize model using an importance matrix for specified tensors only (attn_v and ffn_down)
 77./llama-quantize --imatrix imatrix.gguf --include-weights attn_v --include-weights ffn_down input-model-f32.gguf q4_k_m 8
 78```
 79
 80```bash
 81# quantize model setting output tensor to Q5_K_M, token embeddings to Q3_K_M, and keeping the input file's shards
 82./llama-quantize --imatrix imatrix.gguf --output-tensor-type q5_k --token-embedding-type q3_k --keep-split input-model-f32.gguf q4_k_m 8
 83```
 84
 85```bash
 86# quantize model using a regex to quantize attn_k tensors in odd layers to Q5_K_M and attn_q tensors in even layers to Q3_K_M
 87./llama-quantize --imatrix imatrix.gguf --tensor-type "\.(\d*[13579])\.attn_k=q5_k" --tensor-type "\.(\d*[02468])\.attn_q=q3_k" input-model-f32.gguf q4_k_m 8
 88```
 89
 90```bash
 91# quantize model setting tensors attn_v and ffn_down to Q5_K_M and pruning layers 20, 21, and 22
 92./llama-quantize --imatrix imatrix.gguf --tensor-type attn_v=q5_k --tensor-type ffn_down=q5_k --prune-layers 20,21,22 input-model-f32.gguf q4_k_m 8
 93```
 94
 95```bash
 96# override expert used count metadata to 16, prune layers 20, 21, and 22 without quantizing the model (copy tensors) and use specified name for the output file
 97./llama-quantize --imatrix imatrix.gguf --override-kv qwen3moe.expert_used_count=int:16 --prune-layers 20,21,22 input-model-f32.gguf pruned-model-f32.gguf copy 8
 98```
 99
100## Memory/Disk Requirements
101
102When running the larger models, make sure you have enough disk space to store all the intermediate files.
103As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same. For exmaple (Llama 3.1):
104
105| Model | Original size | Quantized size (Q4_K_M) |
106| ----: | ------------: | ----------------------: |
107|    8B |       32.1 GB |                  4.9 GB |
108|   70B |      280.9 GB |                 43.1 GB |
109|  405B |    1,625.1 GB |                249.1 GB |
110
111
112## Quantization
113
114Several quantization methods are supported. They differ in the resulting model disk size and inference speed. For example,
115
116### [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B)
117
118| Measure                     | IQ1_S        | IQ1_M        | IQ2_XXS      | IQ2_XS        | IQ2_S         | IQ2_M        |
119| --------------------------- | ------------ | ------------ | ------------ | ------------- | ------------- | ------------ |
120| bits/weight                 |       2.0042 |       2.1460 |       2.3824 |        2.5882 |        2.7403 |       2.9294 |
121| size (GiB)                  |       1.87   |       2.01   |       2.23   |        2.42   |        2.56   |       2.74   |
122| prompt processing t/s @ 512 | 858.88 ±1.22 | 847.99 ±0.47 | 852.39 ±0.85 | 826.99 ±12.51 | 783.55 ±13.73 | 787.68 ±7.00 |
123| text generation t/s @ 128   |  79.73 ±0.79 |  72.92 ±0.14 |  79.86 ±0.22 |  78.04 ±0.46  |  77.30 ±2.47  |  74.44 ±0.15 |
124
125| Measure                     | IQ3_XXS      | IQ3_XS       | IQ3_S        | IQ3_M         | IQ4_XS        | IQ4_NL       |
126| --------------------------- | ------------ | ------------ | ------------ | ------------- | ------------- | ------------ |
127| bits/weight                 |       3.2548 |       3.4977 |       3.6606 |        3.7628 |        4.4597 |       4.6818 |
128| size (GiB)                  |       3.04   |       3.27   |       3.42   |        3.52   |        4.17   |       4.38   |
129| prompt processing t/s @ 512 | 813.88 ±6.53 | 708.71 ±1.26 | 798.78 ±8.81 | 768.70 ±13.73 | 771.80 ±11.38 | 806.03 ±7.07 |
130| text generation t/s @ 128   |  73.95 ±0.20 |  71.67 ±0.54 |  69.31 ±0.63 |  70.15 ±0.33  |  77.51 ±0.20  |  76.63 ±0.28 |
131
132
133| Measure                     | Q2_K_S       | Q2_K         | Q3_K_S       | Q3_K_M       | Q3_K_L       | Q4_K_S       |
134| --------------------------- | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |
135| bits/weight                 |       2.9697 |       3.1593 |       3.6429 |       3.9960 |       4.2979 |       4.6672 |
136| size (GiB)                  |       2.78   |       2.95   |       3.41   |       3.74   |       4.02   |       4.36   |
137| prompt processing t/s @ 512 | 798.91 ±6.40 | 784.45 ±7.85 | 752.17 ±7.94 | 783.44 ±9.92 | 761.17 ±7.55 | 818.55 ±9.58 |
138| text generation t/s @ 128   |  90.01 ±0.12 |  79.85 ±0.20 |  69.84 ±0.18 |  71.68 ±0.22 |  69.38 ±0.49 |  76.71 ±0.20 |
139
140| Measure                     | Q4_K_S       | Q4_K_M        | Q5_K_S       | Q5_K_M       | Q6_K          | Q8_0         |
141| --------------------------- | ------------ | ------------- | ------------ | ------------ | ------------- | ------------ |
142| bits/weight                 |       4.6672 |        4.8944 |       5.5704 |       5.7036 |        6.5633 |       8.5008 |
143| size (GiB)                  |       4.36   |        4.58   |       5.21   |       5.33   |        6.14   |       7.95   |
144| prompt processing t/s @ 512 | 818.55 ±9.58 | 821.81 ±21.44 | 752.52 ±0.99 | 758.69 ±7.43 | 812.01 ±10.82 | 865.09 ±8.30 |
145| text generation t/s @ 128   |  76.71 ±0.20 |  71.93 ±1.52  |  69.53 ±0.18 |  67.23 ±1.08 |  58.67 ±3.13  |  50.93 ±0.08 |
146
147| Measure                     | F16          |
148| --------------------------- | ------------ |
149| bits/weight                 |      16.0005 |
150| size (GiB)                  |      14.96   |
151| prompt processing t/s @ 512 | 923.49 ±0.53 |
152| text generation t/s @ 128   |  29.17 ±0.04 |
153
154## Background information on llama-quantize
155
156- [k-quants](https://github.com/ggml-org/llama.cpp/pull/1684)
157- k-quants improvements and i-quants
158  - [#2707](https://github.com/ggml-org/llama.cpp/pull/2707)
159  - [#2807](https://github.com/ggml-org/llama.cpp/pull/2807)
160  - [#4773 - 2-bit i-quants (inference)](https://github.com/ggml-org/llama.cpp/pull/4773)
161  - [#4856 - 2-bit i-quants (inference)](https://github.com/ggml-org/llama.cpp/pull/4856)
162  - [#4861 - importance matrix](https://github.com/ggml-org/llama.cpp/pull/4861)
163  - [#4872 - MoE models](https://github.com/ggml-org/llama.cpp/pull/4872)
164  - [#4897 - 2-bit quantization](https://github.com/ggml-org/llama.cpp/pull/4897)
165  - [#4930 - imatrix for all k-quants](https://github.com/ggml-org/llama.cpp/pull/4930)
166  - [#4951 - imatrix on the GPU](https://github.com/ggml-org/llama.cpp/pull/4957)
167  - [#4969 - imatrix for legacy quants](https://github.com/ggml-org/llama.cpp/pull/4969)
168  - [#4996 - k-quants tuning](https://github.com/ggml-org/llama.cpp/pull/4996)
169  - [#5060 - Q3_K_XS](https://github.com/ggml-org/llama.cpp/pull/5060)
170  - [#5196 - 3-bit i-quants](https://github.com/ggml-org/llama.cpp/pull/5196)
171  - [quantization tuning](https://github.com/ggml-org/llama.cpp/pull/5320), [another one](https://github.com/ggml-org/llama.cpp/pull/5334), and [another one](https://github.com/ggml-org/llama.cpp/pull/5361)