llmnpc - llama.cpp/docs/multimodal.md

Path: llmnpc / llama.cpp / docs / multimodal.md (raw)
  1# Multimodal
  2
  3llama.cpp supports multimodal input via `libmtmd`. Currently, there are 2 tools support this feature:
  4- [llama-mtmd-cli](../tools/mtmd/README.md)
  5- [llama-server](../tools/server/README.md) via OpenAI-compatible `/chat/completions` API
  6
  7Currently, we support **image** and **audio** input. Audio is highly experimental and may have reduced quality.
  8
  9To enable it, you can use one of the 2 methods below:
 10
 11- Use `-hf` option with a supported model (see a list of pre-quantized model below)
 12    - To load a model using `-hf` while disabling multimodal, use `--no-mmproj`
 13    - To load a model using `-hf` while using a custom mmproj file, use `--mmproj local_file.gguf`
 14- Use `-m model.gguf` option with `--mmproj file.gguf` to specify text and multimodal projector respectively
 15
 16By default, multimodal projector will be offloaded to GPU. To disable this, add `--no-mmproj-offload`
 17
 18For example:
 19
 20```sh
 21# simple usage with CLI
 22llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF
 23
 24# simple usage with server
 25llama-server -hf ggml-org/gemma-3-4b-it-GGUF
 26
 27# using local file
 28llama-server -m gemma-3-4b-it-Q4_K_M.gguf --mmproj mmproj-gemma-3-4b-it-Q4_K_M.gguf
 29
 30# no GPU offload
 31llama-server -hf ggml-org/gemma-3-4b-it-GGUF --no-mmproj-offload
 32```
 33
 34## Pre-quantized models
 35
 36These are ready-to-use models, most of them come with `Q4_K_M` quantization by default. They can be found at the Hugging Face page of the ggml-org: https://huggingface.co/collections/ggml-org/multimodal-ggufs-68244e01ff1f39e5bebeeedc
 37
 38Replaces the `(tool_name)` with the name of binary you want to use. For example, `llama-mtmd-cli` or `llama-server`
 39
 40NOTE: some models may require large context window, for example: `-c 8192`
 41
 42**Vision models**:
 43
 44```sh
 45# Gemma 3
 46(tool_name) -hf ggml-org/gemma-3-4b-it-GGUF
 47(tool_name) -hf ggml-org/gemma-3-12b-it-GGUF
 48(tool_name) -hf ggml-org/gemma-3-27b-it-GGUF
 49
 50# SmolVLM
 51(tool_name) -hf ggml-org/SmolVLM-Instruct-GGUF
 52(tool_name) -hf ggml-org/SmolVLM-256M-Instruct-GGUF
 53(tool_name) -hf ggml-org/SmolVLM-500M-Instruct-GGUF
 54(tool_name) -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF
 55(tool_name) -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF
 56(tool_name) -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF
 57
 58# Pixtral 12B
 59(tool_name) -hf ggml-org/pixtral-12b-GGUF
 60
 61# Qwen 2 VL
 62(tool_name) -hf ggml-org/Qwen2-VL-2B-Instruct-GGUF
 63(tool_name) -hf ggml-org/Qwen2-VL-7B-Instruct-GGUF
 64
 65# Qwen 2.5 VL
 66(tool_name) -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF
 67(tool_name) -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF
 68(tool_name) -hf ggml-org/Qwen2.5-VL-32B-Instruct-GGUF
 69(tool_name) -hf ggml-org/Qwen2.5-VL-72B-Instruct-GGUF
 70
 71# Mistral Small 3.1 24B (IQ2_M quantization)
 72(tool_name) -hf ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
 73
 74# InternVL 2.5 and 3
 75(tool_name) -hf ggml-org/InternVL2_5-1B-GGUF
 76(tool_name) -hf ggml-org/InternVL2_5-4B-GGUF
 77(tool_name) -hf ggml-org/InternVL3-1B-Instruct-GGUF
 78(tool_name) -hf ggml-org/InternVL3-2B-Instruct-GGUF
 79(tool_name) -hf ggml-org/InternVL3-8B-Instruct-GGUF
 80(tool_name) -hf ggml-org/InternVL3-14B-Instruct-GGUF
 81
 82# Llama 4 Scout
 83(tool_name) -hf ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF
 84
 85# Moondream2 20250414 version
 86(tool_name) -hf ggml-org/moondream2-20250414-GGUF
 87
 88```
 89
 90**Audio models**:
 91
 92```sh
 93# Ultravox 0.5
 94(tool_name) -hf ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF
 95(tool_name) -hf ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF
 96
 97# Qwen2-Audio and SeaLLM-Audio
 98# note: no pre-quantized GGUF this model, as they have very poor result
 99# ref: https://github.com/ggml-org/llama.cpp/pull/13760
100
101# Mistral's Voxtral
102(tool_name) -hf ggml-org/Voxtral-Mini-3B-2507-GGUF
103```
104
105**Mixed modalities**:
106
107```sh
108# Qwen2.5 Omni
109# Capabilities: audio input, vision input
110(tool_name) -hf ggml-org/Qwen2.5-Omni-3B-GGUF
111(tool_name) -hf ggml-org/Qwen2.5-Omni-7B-GGUF
112```
113
114## Finding more models:
115
116GGUF models on Huggingface with vision capabilities can be found here: https://huggingface.co/models?pipeline_tag=image-text-to-text&sort=trending&search=gguf