llmnpc - llama.cpp/examples/model-conversion/README.md

Path: llmnpc / llama.cpp / examples / model-conversion / README.md (raw)
  1# Model Conversion Example
  2This directory contains scripts and code to help in the process of converting
  3HuggingFace PyTorch models to GGUF format.
  4
  5The motivation for having this is that the conversion process can often be an
  6iterative process, where the original model is inspected, converted, updates
  7made to llama.cpp, converted again, etc. Once the model has been converted it
  8needs to be verified against the original model, and then optionally quantified,
  9and in some cases perplexity checked of the quantized model. And finally the
 10model/models need to the ggml-org on Hugging Face. This tool/example tries to
 11help with this process.
 12
 13> 📝 **Note:** When adding a new model from an existing family, verify the
 14> previous version passes logits verification first. Existing models can have
 15> subtle numerical differences that don't affect generation quality but cause
 16> logits mismatches. Identifying these upfront whether they exist in llama.cpp,
 17> the conversion script, or in an upstream implementation, can save significant
 18> debugging time.
 19
 20### Overview
 21The idea is that the makefile targets and scripts here can be used in the
 22development/conversion process assisting with things like:
 23
 24* inspect/run the original model to figure out how it works
 25* convert the original model to GGUF format
 26* inspect/run the converted model
 27* verify the logits produced by the original model and the converted model
 28* quantize the model to GGUF format
 29* run perplexity evaluation to verify that the quantized model is performing
 30  as expected
 31* upload the model to HuggingFace to make it available for others
 32
 33## Setup
 34Create virtual python environment
 35```console
 36$ python3.11 -m venv venv
 37$ source venv/bin/activate
 38(venv) $ pip install -r requirements.txt
 39```
 40
 41## Causal Language Model Conversion
 42This section describes the steps to convert a causal language model to GGUF and
 43to verify that the conversion was successful.
 44
 45### Download the original model
 46First, clone the original model to some local directory:
 47```console
 48$ mkdir models && cd models
 49$ git clone https://huggingface.co/user/model_name
 50$ cd model_name
 51$ git lfs install
 52$ git lfs pull
 53```
 54
 55### Set the MODEL_PATH
 56The path to the downloaded model can be provided in two ways:
 57
 58**Option 1: Environment variable (recommended for iterative development)**
 59```console
 60export MODEL_PATH=~/work/ai/models/some_model
 61```
 62
 63**Option 2: Command line argument (for one-off tasks)**
 64```console
 65make causal-convert-model MODEL_PATH=~/work/ai/models/some_model
 66```
 67
 68Command line arguments take precedence over environment variables when both are provided.
 69
 70In cases where the transformer implementation for the model has not been released
 71yet it is possible to set the environment variable `UNRELEASED_MODEL_NAME` which
 72will then cause the transformer implementation to be loaded explicitely and not
 73use AutoModelForCausalLM:
 74```
 75export UNRELEASED_MODEL_NAME=SomeNewModel
 76```
 77
 78### Inspecting the original tensors
 79```console
 80# Using environment variable
 81(venv) $ make causal-inspect-original-model
 82
 83# Or using command line argument
 84(venv) $ make causal-inspect-original-model MODEL_PATH=~/work/ai/models/some_model
 85```
 86
 87### Running the original model
 88This is mainly to verify that the original model works, and to compare the output
 89from the converted model.
 90```console
 91# Using environment variable
 92(venv) $ make causal-run-original-model
 93
 94# Or using command line argument
 95(venv) $ make causal-run-original-model MODEL_PATH=~/work/ai/models/some_model
 96```
 97This command will save two files to the `data` directory, one is a binary file
 98containing logits which will be used for comparison with the converted model
 99later, and the other is a text file which allows for manual visual inspection.
100
101### Model conversion
102After updates have been made to [gguf-py](../../gguf-py) to add support for the
103new model, the model can be converted to GGUF format using the following command:
104```console
105# Using environment variable
106(venv) $ make causal-convert-model
107
108# Or using command line argument
109(venv) $ make causal-convert-model MODEL_PATH=~/work/ai/models/some_model
110```
111
112### Inspecting the converted model
113The converted model can be inspected using the following command:
114```console
115(venv) $ make causal-inspect-converted-model
116```
117
118### Running the converted model
119```console
120(venv) $ make causal-run-converted-model
121```
122
123### Model logits verfication
124The following target will run the original model and the converted model and
125compare the logits:
126```console
127(venv) $ make causal-verify-logits
128```
129
130### Quantizing the model
131The causal model can be quantized to GGUF format using the following command:
132```console
133(venv) $ make causal-quantize-Q8_0
134Quantized model saved to: /path/to/quantized/model-Q8_0.gguf
135Export the quantized model path to QUANTIZED_MODEL variable in your environment
136```
137This will show the path to the quantized model in the terminal, which can then
138be used to set the `QUANTIZED_MODEL` environment variable:
139```console
140export QUANTIZED_MODEL=/path/to/quantized/model-Q8_0.gguf
141```
142Then the quantized model can be run using the following command:
143```console
144(venv) $ make causal-run-quantized-model
145```
146
147### Quantizing QAT (Quantization Aware Training) models
148When quantizing to `Q4_0`, the default data type for the token embedding weights
149will be `Q6_K`. For models that are going to be uploaded to ggml-org it is
150recommended to use `Q8_0` instead for the embeddings and output tensors.
151The reason is that although `Q6_K` is smaller in size, it requires more compute
152to unpack, which can hurt performance during output generation when the entire
153embedding matrix must be dequantized to compute vocabulary logits. `Q8_0`
154provides practically full quality with better computational efficiency.
155```console
156(venv) $ make causal-quantize-qat-Q4_0
157```
158
159
160## Embedding Language Model Conversion
161
162### Download the original model
163```console
164$ mkdir models && cd models
165$ git clone https://huggingface.co/user/model_name
166$ cd model_name
167$ git lfs install
168$ git lfs pull
169```
170
171The path to the embedding model can be provided in two ways:
172
173**Option 1: Environment variable (recommended for iterative development)**
174```console
175export EMBEDDING_MODEL_PATH=~/path/to/embedding_model
176```
177
178**Option 2: Command line argument (for one-off tasks)**
179```console
180make embedding-convert-model EMBEDDING_MODEL_PATH=~/path/to/embedding_model
181```
182
183Command line arguments take precedence over environment variables when both are provided.
184
185### Running the original model
186This is mainly to verify that the original model works and to compare the output
187with the output from the converted model.
188```console
189# Using environment variable
190(venv) $ make embedding-run-original-model
191
192# Or using command line argument
193(venv) $ make embedding-run-original-model EMBEDDING_MODEL_PATH=~/path/to/embedding_model
194```
195This command will save two files to the `data` directory, one is a binary
196file containing logits which will be used for comparison with the converted
197model, and the other is a text file which allows for manual visual inspection.
198
199#### Using SentenceTransformer with numbered layers
200For models that have numbered SentenceTransformer layers (01_Pooling, 02_Dense,
20103_Dense, 04_Normalize), these will be applied automatically when running the
202converted model but currently there is a separate target to run the original
203version:
204
205```console
206# Run original model with SentenceTransformer (applies all numbered layers)
207(venv) $ make embedding-run-original-model-st
208```
209
210This will use the SentenceTransformer library to load and run the model, which
211automatically applies all the numbered layers in the correct order. This is
212particularly useful when comparing with models that should include these
213additional transformation layers beyond just the base model output.
214
215The type of normalization can be specified for the converted model but is not
216strictly necessary as the verification uses cosine similarity and the magnitude
217of the output vectors does not affect this. But the normalization type can be
218specified as an argument to the target which might be useful for manual
219inspection:
220```console
221(venv) $ make embedding-verify-logits-st EMBD_NORMALIZE=1
222```
223The original model will apply the normalization according to the normalization
224layer specified in the modules.json configuration file.
225
226### Model conversion
227After updates have been made to [gguf-py](../../gguf-py) to add support for the
228new model the model can be converted to GGUF format using the following command:
229```console
230(venv) $ make embedding-convert-model
231```
232
233### Run the converted model
234```console
235(venv) $ make embedding-run-converted-model
236```
237
238### Model logits verfication
239The following target will run the original model and the converted model (which
240was done manually in the previous steps) and compare the logits:
241```console
242(venv) $ make embedding-verify-logits
243```
244
245For models with SentenceTransformer layers, use the `-st` verification target:
246```console
247(venv) $ make embedding-verify-logits-st
248```
249This convenience target automatically runs both the original model with SentenceTransformer
250and the converted model with pooling enabled, then compares the results.
251
252### llama-server verification
253To verify that the converted model works with llama-server, the following
254command can be used:
255```console
256(venv) $ make embedding-start-embedding-server
257```
258Then open another terminal and set the `EMBEDDINGS_MODEL_PATH` environment
259variable as this will not be inherited by the new terminal:
260```console
261(venv) $ make embedding-curl-embedding-endpoint
262```
263This will call the `embedding` endpoing and the output will be piped into
264the same verification script as used by the target `embedding-verify-logits`.
265
266The causal model can also be used to produce embeddings and this can be verified
267using the following commands:
268```console
269(venv) $ make causal-start-embedding-server
270```
271Then open another terminal and set the `MODEL_PATH` environment
272variable as this will not be inherited by the new terminal:
273```console
274(venv) $ make casual-curl-embedding-endpoint
275```
276
277### Quantizing the model
278The embedding model can be quantized to GGUF format using the following command:
279```console
280(venv) $ make embedding-quantize-Q8_0
281Quantized model saved to: /path/to/quantized/model-Q8_0.gguf
282Export the quantized model path to QUANTIZED_EMBEDDING_MODEL variable in your environment
283```
284This will show the path to the quantized model in the terminal, which can then
285be used to set the `QUANTIZED_EMBEDDING_MODEL` environment variable:
286```console
287export QUANTIZED_EMBEDDING_MODEL=/path/to/quantized/model-Q8_0.gguf
288```
289Then the quantized model can be run using the following command:
290```console
291(venv) $ make embedding-run-quantized-model
292```
293
294### Quantizing QAT (Quantization Aware Training) models
295When quantizing to `Q4_0`, the default data type for the token embedding weights
296will be `Q6_K`. For models that are going to be uploaded to ggml-org it is
297recommended to use `Q8_0` instead for the embeddings and output tensors.
298The reason is that although `Q6_K` is smaller in size, it requires more compute
299to unpack, which can hurt performance during output generation when the entire
300embedding matrix must be dequantized to compute vocabulary logits. `Q8_0`
301provides practically full quality with better computational efficiency.
302```console
303(venv) $ make embedding-quantize-qat-Q4_0
304```
305
306## Perplexity Evaluation
307
308### Simple perplexity evaluation
309This allows to run the perplexity evaluation without having to generate a
310token/logits file:
311```console
312(venv) $ make perplexity-run QUANTIZED_MODEL=~/path/to/quantized/model.gguf
313```
314This will use the wikitext dataset to run the perplexity evaluation and
315output the perplexity score to the terminal. This value can then be compared
316with the perplexity score of the unquantized model.
317
318### Full perplexity evaluation
319First use the converted, non-quantized, model to generate the perplexity evaluation
320dataset using the following command:
321```console
322$ make perplexity-data-gen CONVERTED_MODEL=~/path/to/converted/model.gguf
323```
324This will generate a file in the `data` directory named after the model and with
325a `.kld` suffix which contains the tokens and the logits for the wikitext dataset.
326
327After the dataset has been generated, the perplexity evaluation can be run using
328the quantized model:
329```console
330$ make perplexity-run-full QUANTIZED_MODEL=~/path/to/quantized/model-Qxx.gguf LOGITS_FILE=data/model.gguf.ppl
331```
332
333> 📝 **Note:** The `LOGITS_FILE` is the file generated by the previous command
334> can be very large, so make sure you have enough disk space available.
335
336## HuggingFace utilities
337The following targets are useful for creating collections and model repositories
338on Hugging Face in the the ggml-org. These can be used when preparing a relase
339to script the process for new model releases.
340
341For the following targets a `HF_TOKEN` environment variable is required.
342
343> 📝 **Note:** Don't forget to logout from Hugging Face after running these
344> commands, otherwise you might have issues pulling/cloning repositories as
345> the token will still be in use:
346> $ huggingface-cli logout
347> $ unset HF_TOKEN
348
349### Create a new Hugging Face Model (model repository)
350This will create a new model repsository on Hugging Face with the specified
351model name.
352```console
353(venv) $ make hf-create-model MODEL_NAME='TestModel' NAMESPACE="danbev" ORIGINAL_BASE_MODEL="some-base-model"
354Repository ID:  danbev/TestModel-GGUF
355Repository created: https://huggingface.co/danbev/TestModel-GGUF
356```
357Note that we append a `-GGUF` suffix to the model name to ensure a consistent
358naming convention for GGUF models.
359
360An embedding model can be created using the following command:
361```console
362(venv) $ make hf-create-model-embedding MODEL_NAME='TestEmbeddingModel' NAMESPACE="danbev" ORIGINAL_BASE_MODEL="some-base-model"
363```
364The only difference is that the model card for an embedding model will be different
365with regards to the llama-server command and also how to access/call the embedding
366endpoint.
367
368### Upload a GGUF model to model repository
369The following target uploads a model to an existing Hugging Face model repository.
370```console
371(venv) $ make hf-upload-gguf-to-model MODEL_PATH=dummy-model1.gguf REPO_ID=danbev/TestModel-GGUF
372📤 Uploading dummy-model1.gguf to danbev/TestModel-GGUF/dummy-model1.gguf
373✅ Upload successful!
374🔗 File available at: https://huggingface.co/danbev/TestModel-GGUF/blob/main/dummy-model1.gguf
375```
376This command can also be used to update an existing model file in a repository.
377
378### Create a new Collection
379```console
380(venv) $ make hf-new-collection NAME=TestCollection DESCRIPTION="Collection for testing scripts" NAMESPACE=danbev
381🚀 Creating Hugging Face Collection
382Title: TestCollection
383Description: Collection for testing scripts
384Namespace: danbev
385Private: False
386✅ Authenticated as: danbev
387📚 Creating collection: 'TestCollection'...
388✅ Collection created successfully!
389📋 Collection slug: danbev/testcollection-68930fcf73eb3fc200b9956d
390🔗 Collection URL: https://huggingface.co/collections/danbev/testcollection-68930fcf73eb3fc200b9956d
391
392🎉 Collection created successfully!
393Use this slug to add models: danbev/testcollection-68930fcf73eb3fc200b9956d
394```
395
396### Add model to a Collection
397```console
398(venv) $ make hf-add-model-to-collection COLLECTION=danbev/testcollection-68930fcf73eb3fc200b9956d MODEL=danbev/TestModel-GGUF
399✅ Authenticated as: danbev
400🔍 Checking if model exists: danbev/TestModel-GGUF
401✅ Model found: danbev/TestModel-GGUF
402📚 Adding model to collection...
403✅ Model added to collection successfully!
404🔗 Collection URL: https://huggingface.co/collections/danbev/testcollection-68930fcf73eb3fc200b9956d
405
406🎉 Model added successfully!
407
408```