aboutsummaryrefslogtreecommitdiff
path: root/llama.cpp/docs/multimodal/granitevision.md
diff options
context:
space:
mode:
authorMitja Felicijan <mitja.felicijan@gmail.com>2026-02-12 20:57:17 +0100
committerMitja Felicijan <mitja.felicijan@gmail.com>2026-02-12 20:57:17 +0100
commitb333b06772c89d96aacb5490d6a219fba7c09cc6 (patch)
tree211df60083a5946baa2ed61d33d8121b7e251b06 /llama.cpp/docs/multimodal/granitevision.md
downloadllmnpc-b333b06772c89d96aacb5490d6a219fba7c09cc6.tar.gz
Engage!
Diffstat (limited to 'llama.cpp/docs/multimodal/granitevision.md')
-rw-r--r--llama.cpp/docs/multimodal/granitevision.md186
1 files changed, 186 insertions, 0 deletions
diff --git a/llama.cpp/docs/multimodal/granitevision.md b/llama.cpp/docs/multimodal/granitevision.md
new file mode 100644
index 0000000..3118fe0
--- /dev/null
+++ b/llama.cpp/docs/multimodal/granitevision.md
@@ -0,0 +1,186 @@
1# Granite Vision
2
3Download the model and point your `GRANITE_MODEL` environment variable to the path.
4
5```bash
6$ git clone https://huggingface.co/ibm-granite/granite-vision-3.2-2b
7$ export GRANITE_MODEL=./granite-vision-3.2-2b
8```
9
10
11### 1. Running llava surgery v2.
12First, we need to run the llava surgery script as shown below:
13
14`python llava_surgery_v2.py -C -m $GRANITE_MODEL`
15
16You should see two new files (`llava.clip` and `llava.projector`) written into your model's directory, as shown below.
17
18```bash
19$ ls $GRANITE_MODEL | grep -i llava
20llava.clip
21llava.projector
22```
23
24We should see that the projector and visual encoder get split out into the llava files. Quick check to make sure they aren't empty:
25```python
26import os
27import torch
28
29MODEL_PATH = os.getenv("GRANITE_MODEL")
30if not MODEL_PATH:
31 raise ValueError("env var GRANITE_MODEL is unset!")
32
33encoder_tensors = torch.load(os.path.join(MODEL_PATH, "llava.clip"))
34projector_tensors = torch.load(os.path.join(MODEL_PATH, "llava.projector"))
35
36assert len(encoder_tensors) > 0
37assert len(projector_tensors) > 0
38```
39
40If you actually inspect the `.keys()` of the loaded tensors, you should see a lot of `vision_model` tensors in the `encoder_tensors`, and 5 tensors (`'multi_modal_projector.linear_1.bias'`, `'multi_modal_projector.linear_1.weight'`, `'multi_modal_projector.linear_2.bias'`, `'multi_modal_projector.linear_2.weight'`, `'image_newline'`) in the multimodal `projector_tensors`.
41
42
43### 2. Creating the Visual Component GGUF
44Next, create a new directory to hold the visual components, and copy the llava.clip/projector files, as shown below.
45
46```bash
47$ ENCODER_PATH=$PWD/visual_encoder
48$ mkdir $ENCODER_PATH
49
50$ cp $GRANITE_MODEL/llava.clip $ENCODER_PATH/pytorch_model.bin
51$ cp $GRANITE_MODEL/llava.projector $ENCODER_PATH/
52```
53
54Now, we need to write a config for the visual encoder. In order to convert the model, be sure to use the correct `image_grid_pinpoints`, as these may vary based on the model. You can find the `image_grid_pinpoints` in `$GRANITE_MODEL/config.json`.
55
56```json
57{
58 "_name_or_path": "siglip-model",
59 "architectures": [
60 "SiglipVisionModel"
61 ],
62 "image_grid_pinpoints": [
63 [384,384],
64 [384,768],
65 [384,1152],
66 [384,1536],
67 [384,1920],
68 [384,2304],
69 [384,2688],
70 [384,3072],
71 [384,3456],
72 [384,3840],
73 [768,384],
74 [768,768],
75 [768,1152],
76 [768,1536],
77 [768,1920],
78 [1152,384],
79 [1152,768],
80 [1152,1152],
81 [1536,384],
82 [1536,768],
83 [1920,384],
84 [1920,768],
85 [2304,384],
86 [2688,384],
87 [3072,384],
88 [3456,384],
89 [3840,384]
90 ],
91 "mm_patch_merge_type": "spatial_unpad",
92 "hidden_size": 1152,
93 "image_size": 384,
94 "intermediate_size": 4304,
95 "model_type": "siglip_vision_model",
96 "num_attention_heads": 16,
97 "num_hidden_layers": 27,
98 "patch_size": 14,
99 "layer_norm_eps": 1e-6,
100 "hidden_act": "gelu_pytorch_tanh",
101 "projection_dim": 0,
102 "vision_feature_layer": [-24, -20, -12, -1]
103}
104```
105
106At this point you should have something like this:
107```bash
108$ ls $ENCODER_PATH
109config.json llava.projector pytorch_model.bin
110```
111
112Now convert the components to GGUF; Note that we also override the image mean/std dev to `[.5,.5,.5]` since we use the SigLIP visual encoder - in the transformers model, you can find these numbers in the `preprocessor_config.json`.
113```bash
114$ python convert_image_encoder_to_gguf.py \
115 -m $ENCODER_PATH \
116 --llava-projector $ENCODER_PATH/llava.projector \
117 --output-dir $ENCODER_PATH \
118 --clip-model-is-vision \
119 --clip-model-is-siglip \
120 --image-mean 0.5 0.5 0.5 \
121 --image-std 0.5 0.5 0.5
122```
123
124This will create the first GGUF file at `$ENCODER_PATH/mmproj-model-f16.gguf`; we will refer to the absolute path of this file as the `$VISUAL_GGUF_PATH.`
125
126
127### 3. Creating the LLM GGUF.
128The granite vision model contains a granite LLM as its language model. For now, the easiest way to get the GGUF for LLM is by loading the composite model in `transformers` and exporting the LLM so that it can be directly converted with the normal conversion path.
129
130First, set the `LLM_EXPORT_PATH` to the path to export the `transformers` LLM to.
131```bash
132$ export LLM_EXPORT_PATH=$PWD/granite_vision_llm
133```
134
135```python
136import os
137import transformers
138
139MODEL_PATH = os.getenv("GRANITE_MODEL")
140if not MODEL_PATH:
141 raise ValueError("env var GRANITE_MODEL is unset!")
142
143LLM_EXPORT_PATH = os.getenv("LLM_EXPORT_PATH")
144if not LLM_EXPORT_PATH:
145 raise ValueError("env var LLM_EXPORT_PATH is unset!")
146
147tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_PATH)
148
149# NOTE: granite vision support was added to transformers very recently (4.49);
150# if you get size mismatches, your version is too old.
151# If you are running with an older version, set `ignore_mismatched_sizes=True`
152# as shown below; it won't be loaded correctly, but the LLM part of the model that
153# we are exporting will be loaded correctly.
154model = transformers.AutoModelForImageTextToText.from_pretrained(MODEL_PATH, ignore_mismatched_sizes=True)
155
156tokenizer.save_pretrained(LLM_EXPORT_PATH)
157model.language_model.save_pretrained(LLM_EXPORT_PATH)
158```
159
160Now you can convert the exported LLM to GGUF with the normal converter in the root of the llama cpp project.
161```bash
162$ LLM_GGUF_PATH=$LLM_EXPORT_PATH/granite_llm.gguf
163...
164$ python convert_hf_to_gguf.py --outfile $LLM_GGUF_PATH $LLM_EXPORT_PATH
165```
166
167
168### 4. Quantization
169If you want to quantize the LLM, you can do so with `llama-quantize` as you would any other LLM. For example:
170```bash
171$ ./build/bin/llama-quantize $LLM_EXPORT_PATH/granite_llm.gguf $LLM_EXPORT_PATH/granite_llm_q4_k_m.gguf Q4_K_M
172$ LLM_GGUF_PATH=$LLM_EXPORT_PATH/granite_llm_q4_k_m.gguf
173```
174
175Note that currently you cannot quantize the visual encoder because granite vision models use SigLIP as the visual encoder, which has tensor dimensions that are not divisible by 32.
176
177
178### 5. Running the Model in Llama cpp
179Build llama cpp normally; you should have a target binary named `llama-mtmd-cli`, which you can pass two binaries to. As an example, we pass the the llama.cpp banner.
180
181```bash
182$ ./build/bin/llama-mtmd-cli -m $LLM_GGUF_PATH \
183 --mmproj $VISUAL_GGUF_PATH \
184 -c 16384 \
185 --temp 0
186```