diff options
Diffstat (limited to 'llama.cpp/docs/multimodal/granitevision.md')
| -rw-r--r-- | llama.cpp/docs/multimodal/granitevision.md | 186 |
1 files changed, 186 insertions, 0 deletions
diff --git a/llama.cpp/docs/multimodal/granitevision.md b/llama.cpp/docs/multimodal/granitevision.md new file mode 100644 index 0000000..3118fe0 --- /dev/null +++ b/llama.cpp/docs/multimodal/granitevision.md @@ -0,0 +1,186 @@ +# Granite Vision + +Download the model and point your `GRANITE_MODEL` environment variable to the path. + +```bash +$ git clone https://huggingface.co/ibm-granite/granite-vision-3.2-2b +$ export GRANITE_MODEL=./granite-vision-3.2-2b +``` + + +### 1. Running llava surgery v2. +First, we need to run the llava surgery script as shown below: + +`python llava_surgery_v2.py -C -m $GRANITE_MODEL` + +You should see two new files (`llava.clip` and `llava.projector`) written into your model's directory, as shown below. + +```bash +$ ls $GRANITE_MODEL | grep -i llava +llava.clip +llava.projector +``` + +We should see that the projector and visual encoder get split out into the llava files. Quick check to make sure they aren't empty: +```python +import os +import torch + +MODEL_PATH = os.getenv("GRANITE_MODEL") +if not MODEL_PATH: + raise ValueError("env var GRANITE_MODEL is unset!") + +encoder_tensors = torch.load(os.path.join(MODEL_PATH, "llava.clip")) +projector_tensors = torch.load(os.path.join(MODEL_PATH, "llava.projector")) + +assert len(encoder_tensors) > 0 +assert len(projector_tensors) > 0 +``` + +If you actually inspect the `.keys()` of the loaded tensors, you should see a lot of `vision_model` tensors in the `encoder_tensors`, and 5 tensors (`'multi_modal_projector.linear_1.bias'`, `'multi_modal_projector.linear_1.weight'`, `'multi_modal_projector.linear_2.bias'`, `'multi_modal_projector.linear_2.weight'`, `'image_newline'`) in the multimodal `projector_tensors`. + + +### 2. Creating the Visual Component GGUF +Next, create a new directory to hold the visual components, and copy the llava.clip/projector files, as shown below. + +```bash +$ ENCODER_PATH=$PWD/visual_encoder +$ mkdir $ENCODER_PATH + +$ cp $GRANITE_MODEL/llava.clip $ENCODER_PATH/pytorch_model.bin +$ cp $GRANITE_MODEL/llava.projector $ENCODER_PATH/ +``` + +Now, we need to write a config for the visual encoder. In order to convert the model, be sure to use the correct `image_grid_pinpoints`, as these may vary based on the model. You can find the `image_grid_pinpoints` in `$GRANITE_MODEL/config.json`. + +```json +{ + "_name_or_path": "siglip-model", + "architectures": [ + "SiglipVisionModel" + ], + "image_grid_pinpoints": [ + [384,384], + [384,768], + [384,1152], + [384,1536], + [384,1920], + [384,2304], + [384,2688], + [384,3072], + [384,3456], + [384,3840], + [768,384], + [768,768], + [768,1152], + [768,1536], + [768,1920], + [1152,384], + [1152,768], + [1152,1152], + [1536,384], + [1536,768], + [1920,384], + [1920,768], + [2304,384], + [2688,384], + [3072,384], + [3456,384], + [3840,384] + ], + "mm_patch_merge_type": "spatial_unpad", + "hidden_size": 1152, + "image_size": 384, + "intermediate_size": 4304, + "model_type": "siglip_vision_model", + "num_attention_heads": 16, + "num_hidden_layers": 27, + "patch_size": 14, + "layer_norm_eps": 1e-6, + "hidden_act": "gelu_pytorch_tanh", + "projection_dim": 0, + "vision_feature_layer": [-24, -20, -12, -1] +} +``` + +At this point you should have something like this: +```bash +$ ls $ENCODER_PATH +config.json llava.projector pytorch_model.bin +``` + +Now convert the components to GGUF; Note that we also override the image mean/std dev to `[.5,.5,.5]` since we use the SigLIP visual encoder - in the transformers model, you can find these numbers in the `preprocessor_config.json`. +```bash +$ python convert_image_encoder_to_gguf.py \ + -m $ENCODER_PATH \ + --llava-projector $ENCODER_PATH/llava.projector \ + --output-dir $ENCODER_PATH \ + --clip-model-is-vision \ + --clip-model-is-siglip \ + --image-mean 0.5 0.5 0.5 \ + --image-std 0.5 0.5 0.5 +``` + +This will create the first GGUF file at `$ENCODER_PATH/mmproj-model-f16.gguf`; we will refer to the absolute path of this file as the `$VISUAL_GGUF_PATH.` + + +### 3. Creating the LLM GGUF. +The granite vision model contains a granite LLM as its language model. For now, the easiest way to get the GGUF for LLM is by loading the composite model in `transformers` and exporting the LLM so that it can be directly converted with the normal conversion path. + +First, set the `LLM_EXPORT_PATH` to the path to export the `transformers` LLM to. +```bash +$ export LLM_EXPORT_PATH=$PWD/granite_vision_llm +``` + +```python +import os +import transformers + +MODEL_PATH = os.getenv("GRANITE_MODEL") +if not MODEL_PATH: + raise ValueError("env var GRANITE_MODEL is unset!") + +LLM_EXPORT_PATH = os.getenv("LLM_EXPORT_PATH") +if not LLM_EXPORT_PATH: + raise ValueError("env var LLM_EXPORT_PATH is unset!") + +tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_PATH) + +# NOTE: granite vision support was added to transformers very recently (4.49); +# if you get size mismatches, your version is too old. +# If you are running with an older version, set `ignore_mismatched_sizes=True` +# as shown below; it won't be loaded correctly, but the LLM part of the model that +# we are exporting will be loaded correctly. +model = transformers.AutoModelForImageTextToText.from_pretrained(MODEL_PATH, ignore_mismatched_sizes=True) + +tokenizer.save_pretrained(LLM_EXPORT_PATH) +model.language_model.save_pretrained(LLM_EXPORT_PATH) +``` + +Now you can convert the exported LLM to GGUF with the normal converter in the root of the llama cpp project. +```bash +$ LLM_GGUF_PATH=$LLM_EXPORT_PATH/granite_llm.gguf +... +$ python convert_hf_to_gguf.py --outfile $LLM_GGUF_PATH $LLM_EXPORT_PATH +``` + + +### 4. Quantization +If you want to quantize the LLM, you can do so with `llama-quantize` as you would any other LLM. For example: +```bash +$ ./build/bin/llama-quantize $LLM_EXPORT_PATH/granite_llm.gguf $LLM_EXPORT_PATH/granite_llm_q4_k_m.gguf Q4_K_M +$ LLM_GGUF_PATH=$LLM_EXPORT_PATH/granite_llm_q4_k_m.gguf +``` + +Note that currently you cannot quantize the visual encoder because granite vision models use SigLIP as the visual encoder, which has tensor dimensions that are not divisible by 32. + + +### 5. Running the Model in Llama cpp +Build llama cpp normally; you should have a target binary named `llama-mtmd-cli`, which you can pass two binaries to. As an example, we pass the the llama.cpp banner. + +```bash +$ ./build/bin/llama-mtmd-cli -m $LLM_GGUF_PATH \ + --mmproj $VISUAL_GGUF_PATH \ + -c 16384 \ + --temp 0 +``` |
