| TODO |
llama.cpp/.github/workflows/build.yml:1041 |
disabled for now, consider adding tests for all CPU variants instead |
| TODO |
llama.cpp/.github/workflows/build.yml:1079 |
Remove GGML_CUDA_CUB_3DOT2 flag once CCCL 3.2 is bundled within CTK and that CTK version is used in this project |
| TODO |
llama.cpp/.github/workflows/build.yml:1124 |
Remove GGML_CUDA_CUB_3DOT2 flag once CCCL 3.2 is bundled within CTK and that CTK version is used in this project |
| TODO |
llama.cpp/.github/workflows/build.yml:1168 |
add ssl support ; we will also need to modify win-build-sycl.bat to accept user-specified args |
| FIXME |
llama.cpp/.github/workflows/build.yml:1392 |
test on devices" |
| TODO |
llama.cpp/.github/workflows/build.yml:1461 |
simplify the following workflows using a matrix |
| TODO |
llama.cpp/.github/workflows/build.yml:1462 |
run lighter CI on PRs and the full CI only on master (if needed) |
| TODO |
llama.cpp/.github/workflows/release.yml:400 |
Remove GGML_CUDA_CUB_3DOT2 flag once CCCL 3.2 is bundled within CTK and that CTK version is used in this project |
| TODO |
llama.cpp/CMakeLists.txt:42 |
analyze performance impact, see https://spidermonkey.dev/blog/2025/01/15/is-memory64-actually-worth-using |
| TODO |
llama.cpp/CONTRIBUTING.md:149 |
abbreviations usage)_ |
| TODO |
llama.cpp/CONTRIBUTING.md:153 |
add guidelines with examples and apply them to the codebase)_ |
| TODO |
llama.cpp/ci/run.sh:55 |
Remove GGML_CUDA_CUB_3DOT2 flag once CCCL 3.2 is bundled within CTK and that CTK version is used in this project |
| TODO |
llama.cpp/ci/run.sh:334 |
this hangs for some reason ... |
| TODO |
llama.cpp/common/CMakeLists.txt:113 |
use list(APPEND LLAMA_COMMON_EXTRA_LIBS ...) |
| TODO |
llama.cpp/common/arg.cpp:178 |
detect this based on current console |
| TODO |
llama.cpp/common/arg.cpp:667 |
maybe convert enum llama_example to string |
| TODO |
llama.cpp/common/arg.cpp:867 |
support arg with 2 values |
| TODO |
llama.cpp/common/chat-parser-xml-toolcall.cpp:506 |
Delete this when json_partial adds top-level support for null/true/false |
| TODO |
llama.cpp/common/chat-parser-xml-toolcall.cpp:652 |
Note that form.allow_toolcall_in_think is not tested yet. If anyone confirms it works, this comment can be removed. |
| TODO |
llama.cpp/common/chat-parser.cpp:1401 |
Tool calling |
| TODO |
llama.cpp/common/chat-parser.h:22 |
rename to params |
| TODO |
llama.cpp/common/chat.cpp:133 |
these can become expensive for long messages - how to optimize? |
| TODO |
llama.cpp/common/chat.cpp:200 |
this is ugly, refactor it somehow |
| TODO |
llama.cpp/common/chat.cpp:812 |
do we need to merge, or replacing is fine? |
| TODO |
llama.cpp/common/chat.cpp:818 |
merge properly instead of overwriting (matching old behavior) |
| TODO |
llama.cpp/common/chat.cpp:836 |
improve this later |
| TODO |
llama.cpp/common/chat.cpp:2401 |
if (has_raw_python) |
| TODO |
llama.cpp/common/chat.cpp:3228 |
support that mix in handlers below. |
| TODO |
llama.cpp/common/chat.h:155 |
refactor this to "bool enable_thinking" |
| TODO |
llama.cpp/common/chat.h:179 |
refactor this to "bool parse_reasoning" |
| TODO |
llama.cpp/common/common.cpp:101 |
windows + arm64 + mingw64 |
| TODO |
llama.cpp/common/common.cpp:381 |
windows + arm64 + mingw64 |
| TODO |
llama.cpp/common/common.cpp:997 |
move to common/sampling |
| TODO |
llama.cpp/common/common.cpp:1117 |
fix naming |
| TODO |
llama.cpp/common/common.h:521 |
support threadpool) |
| TODO |
llama.cpp/common/common.h:829 |
repace embd_norm with an enum |
| TODO |
llama.cpp/common/console.cpp:1013 |
maybe support multiline history entries? |
| TODO |
llama.cpp/common/download.cpp:360 |
maybe retry only on certain codes |
| TODO |
llama.cpp/common/download.cpp:436 |
use actual GET status? |
| TODO |
llama.cpp/common/download.cpp:747 |
cache the manifest response so that it appears in the model list |
| TODO |
llama.cpp/common/download.cpp:848 |
get GGUF size, not manifest size |
| TODO |
llama.cpp/common/jinja/lexer.cpp:213 |
handle lstrip/rstrip for comments? (not important for now) |
| FIXME |
llama.cpp/common/jinja/parser.cpp:424 |
tests can also be expressed like this: if x is eq 3 |
| TODO |
llama.cpp/common/jinja/runtime.h:585 |
probably allow print value_none as "None" string? currently this breaks some templates |
| TODO |
llama.cpp/common/jinja/value.cpp:289 |
make sure this is the same behavior as Python's strftime |
| FIXME |
llama.cpp/common/jinja/value.cpp:575 |
Support non-specified delimiter (split on consecutive (no leading or trailing) whitespace) |
| FIXME |
llama.cpp/common/jinja/value.cpp:599 |
Support non-specified delimiter (split on consecutive (no leading or trailing) whitespace) |
| FIXME |
llama.cpp/common/jinja/value.cpp:916 |
sorting is currently always case sensitive |
| FIXME |
llama.cpp/common/jinja/value.cpp:1027 |
sorting is currently always case sensitive |
| TODO |
llama.cpp/common/jinja/value.cpp:1166 |
not sure if this is the right behavior |
| TODO |
llama.cpp/common/jinja/value.cpp:1220 |
avoid circular references |
| TODO |
llama.cpp/common/jinja/value.cpp:1307 |
avoid circular references |
| TODO |
llama.cpp/common/jinja/value.h:156 |
C++20 <=> operator |
| TODO |
llama.cpp/common/json-partial.cpp:311 |
handle more unclosed top-level primitive if the stack was empty but we got an error (e.g. "tru", "\"", etc...) |
| TODO |
llama.cpp/common/json-partial.h:3 |
use json_fwd.hpp when possible |
| TODO |
llama.cpp/common/json-schema-to-grammar.cpp:971 |
support minimum, maximum, exclusiveMinimum, exclusiveMaximum at least for zero |
| TODO |
llama.cpp/common/peg-parser.cpp:1357 |
Implement more comprehensive grammar generation for raw strings. |
| TODO |
llama.cpp/common/preset.cpp:85 |
maybe throw an error instead? |
| TODO |
llama.cpp/common/preset.h:31 |
maybe implement to_env() if needed |
| TODO |
llama.cpp/common/sampling.cpp:12 |
deduplicate with llama-impl.h |
| TODO |
llama.cpp/common/sampling.cpp:397 |
measure grammar performance |
| TODO |
llama.cpp/common/sampling.cpp:471 |
simplify |
| TODO |
llama.cpp/common/sampling.cpp:617 |
compute this from the vocab |
| TODO |
llama.cpp/common/sampling.h:32 |
measure grammar performance |
| TODO |
llama.cpp/common/speculative.cpp:125 |
track performance of most recent calls |
| TODO |
llama.cpp/common/speculative.cpp:171 |
optimize or pass from outside? |
| TODO |
llama.cpp/common/speculative.cpp:452 |
implement |
| TODO |
llama.cpp/common/speculative.cpp:735 |
noop |
| TODO |
llama.cpp/convert_hf_to_gguf.py:555 |
why do we squeeze here? |
| TODO |
llama.cpp/convert_hf_to_gguf.py:614 |
use Q4_K and Q6_K |
| TODO |
llama.cpp/convert_hf_to_gguf.py:854 |
Handle "sliding_attention" similarly when models start implementing it |
| TODO |
llama.cpp/convert_hf_to_gguf.py:966 |
should these be marked as UNUSED instead? (maybe not) |
| TODO |
llama.cpp/convert_hf_to_gguf.py:2357 |
how to determine special FIM tokens automatically? |
| TODO |
llama.cpp/convert_hf_to_gguf.py:2971 |
remove this once everyone has migrated to newer version of llama.cpp |
| TODO |
llama.cpp/convert_hf_to_gguf.py:3184 |
multiply by the scale directly instead of inverting it twice |
| TODO |
llama.cpp/convert_hf_to_gguf.py:5481 |
this is a hack, should be fixed |
| TODO |
llama.cpp/convert_hf_to_gguf.py:6071 |
these special tokens should be exported only for the CodeGemma family |
| TODO |
llama.cpp/convert_hf_to_gguf.py:6575 |
implement self.prediction_coefs.weight.clamp_(...) |
| TODO |
llama.cpp/convert_hf_to_gguf.py:7073 |
does this really matter? |
| TODO |
llama.cpp/convert_hf_to_gguf.py:7997 |
mimo v2 does not indicate the number of next-token-prediction layers, therefore we cannot do the same way as GLM4_MOE |
| TODO |
llama.cpp/convert_hf_to_gguf.py:9315 |
Extend this if the prefix(es) need to be configurable |
| TODO |
llama.cpp/convert_hf_to_gguf.py:9892 |
remove this once image support is implemented for Chameleon |
| TODO |
llama.cpp/convert_hf_to_gguf.py:10471 |
remove once MXFP4 is supported more generally |
| TODO |
llama.cpp/convert_hf_to_gguf.py:10941 |
remove this once everyone migrates to newer version of llama.cpp |
| TODO |
llama.cpp/convert_hf_to_gguf.py:11526 |
uncomment U64, U32, and U16, ref: https://github.com/pytorch/pytorch/issues/58734 |
| TODO |
llama.cpp/convert_hf_to_gguf_update.py:55 |
generate tokenizer tests for llama.cpp |
| TODO |
llama.cpp/convert_hf_to_gguf_update.py:81 |
this string has to exercise as much pre-tokenizer functionality as possible |
| TODO |
llama.cpp/convert_hf_to_gguf_update.py:85 |
add models here, base models preferred |
| TODO |
llama.cpp/convert_lora_to_gguf.py:64 |
add ellipsis in the type signature |
| TODO |
llama.cpp/convert_lora_to_gguf.py:99 |
make sure this is correct |
| TODO |
llama.cpp/convert_lora_to_gguf.py:167 |
support higher dimensional A shapes bigger than 1 |
| TODO |
llama.cpp/convert_lora_to_gguf.py:173 |
compose the above two |
| TODO |
llama.cpp/examples/convert_legacy_llama.py:133 |
match this with `llama_ftype` |
| TODO |
llama.cpp/examples/convert_legacy_llama.py:134 |
rename to LLAMAFileType |
| TODO |
llama.cpp/examples/convert_legacy_llama.py:135 |
move to `gguf.py` |
| TODO |
llama.cpp/examples/convert_legacy_llama.py:209 |
verify this |
| TODO |
llama.cpp/examples/convert_legacy_llama.py:351 |
reuse (probably move to gguf.py?) |
| FIXME |
llama.cpp/examples/convert_legacy_llama.py:1266 |
Respect --vocab-dir? |
| TODO |
llama.cpp/examples/gguf-hash/deps/xxhash/xxhash.h:798 |
Update to correct value when its been specified. |
| TODO |
llama.cpp/examples/gguf-hash/deps/xxhash/xxhash.h:3920 |
IBM XL */ |
| FIXME |
llama.cpp/examples/gguf-hash/deps/xxhash/xxhash.h:4670 |
Clang's output is still _much_ faster -- On an AMD Ryzen 3600, |
| TODO |
llama.cpp/examples/json_schema_to_grammar.py:218 |
support "uri", "email" string formats |
| TODO |
llama.cpp/examples/json_schema_to_grammar.py:694 |
support minimum, maximum, exclusiveMinimum, exclusiveMaximum at least for zero |
| TODO |
llama.cpp/examples/parallel/parallel.cpp:507 |
print sampling/grammar timings for all clients |
| TODO |
llama.cpp/examples/pydantic_models_to_grammar.py:20 |
fix this |
| TODO |
llama.cpp/examples/retrieval/retrieval.cpp:8 |
remove me |
| TODO |
llama.cpp/examples/speculative-simple/speculative-simple.cpp:51 |
simplify this logic |
| TODO |
llama.cpp/examples/speculative/speculative.cpp:423 |
simplify |
| TODO |
llama.cpp/examples/speculative/speculative.cpp:629 |
print sampling/grammar timings for all drafts |
| TODO |
llama.cpp/ggml/CMakeLists.txt:90 |
mark all options as advanced when not GGML_STANDALONE |
| TODO |
llama.cpp/ggml/include/ggml-metal.h:42 |
remove in the future |
| TODO |
llama.cpp/ggml/include/ggml.h:190 |
support for clang |
| TODO |
llama.cpp/ggml/include/ggml.h:249 |
convert to enum https://github.com/ggml-org/llama.cpp/pull/16187#discussion_r2388538726 |
| TODO |
llama.cpp/ggml/include/ggml.h:749 |
temporary until model loading of ggml examples is refactored |
| TODO |
llama.cpp/ggml/include/ggml.h:1550 |
when we start computing gradient, make a copy instead of view |
| TODO |
llama.cpp/ggml/include/ggml.h:1557 |
when we start computing gradient, make a copy instead of view |
| TODO |
llama.cpp/ggml/include/ggml.h:1570 |
when we start computing gradient, make a copy instead of view |
| TODO |
llama.cpp/ggml/include/ggml.h:1955 |
this is very likely wrong for some cases! - needs more testing |
| TODO |
llama.cpp/ggml/include/ggml.h:2346 |
needs to be adapted to ggml_flash_attn_ext |
| TODO |
llama.cpp/ggml/include/ggml.h:2459 |
currently only lower, right, non-unitriangular variant is implemented |
| TODO |
llama.cpp/ggml/include/ggml.h:2723 |
currently, only a few functions are in the base ggml API, while the rest are in the CPU backend |
| TODO |
llama.cpp/ggml/src/CMakeLists.txt:78 |
should not be set globally |
| TODO |
llama.cpp/ggml/src/CMakeLists.txt:103 |
probably these flags need to be tweaked on some architectures |
| TODO |
llama.cpp/ggml/src/ggml-alloc.c:738 |
better way to add external dependencies |
| FIXME |
llama.cpp/ggml/src/ggml-backend-reg.cpp:163 |
backends cannot be safely unloaded without a function to destroy all the backend resources, |
| FIXME |
llama.cpp/ggml/src/ggml-backend.cpp:182 |
add a generic callback to the buffer interface |
| FIXME |
llama.cpp/ggml/src/ggml-backend.cpp:1199 |
count the number of inputs instead of only checking when full |
| TODO |
llama.cpp/ggml/src/ggml-backend.cpp:1567 |
add public function to facilitate this, since applications do not have direct access to the backend interface |
| TODO |
llama.cpp/ggml/src/ggml-backend.cpp:1609 |
pass backend to the callback, then the user can decide if they want to synchronize |
| FIXME |
llama.cpp/ggml/src/ggml-backend.cpp:1658 |
needs to be size*2 to account for leafs (do it in graph_split instead) |
| TODO |
llama.cpp/ggml/src/ggml-blas/ggml-blas.cpp:411 |
find the optimal value |
| TODO |
llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:1073 |
performace is low. |
| TODO |
llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:2264 |
check theta_scale_length and position_length. |
| TODO |
llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:2341 |
acl_yarn_ramp_tensor use rope cache. |
| TODO |
llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:2812 |
n_dims < ne0 |
| TODO |
llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:2839 |
ne0 != n_dims in mode2 |
| TODO |
llama.cpp/ggml/src/ggml-cann/aclnn_ops.h:883 |
If `ne12 > 1`, grouped multiplication and memory copying is used for efficiency. |
| TODO |
llama.cpp/ggml/src/ggml-cann/common.h:619 |
each stream should have a memory pool. |
| TODO |
llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:173 |
add more device info later. |
| TODO |
llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:1104 |
cann backend doesn't support quantized yet. Just leave the code |
| TODO |
llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:1208 |
need handle tensor which has paddings. |
| TODO |
llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:1229 |
refer to cann(#6017), it use thread's default stream. |
| TODO |
llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:1311 |
Support 310p P2P copy |
| TODO |
llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:1438 |
quantized type? |
| TODO |
llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2016 |
Support 310p P2P copy |
| TODO |
llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2040 |
this event is not effective with acl graph mode, change to use aclrtSynchronizeStream |
| TODO |
llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2096 |
support broadcast for ADD + RMS_NORM |
| TODO |
llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2205 |
Optimize here. Currently, we can only |
| TODO |
llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2354 |
support GGML_TYPE_BF16 |
| TODO |
llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2369 |
Support rope_dim < ne00(dim) |
| TODO |
llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2441 |
add circular padding support for cann, see https://github.com/ggml-org/llama.cpp/pull/16985 |
| TODO |
llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2474 |
support bias != 0.0f |
| TODO |
llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2476 |
support attention sinks [TAG_ATTN_SINKS] |
| TODO |
llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2498 |
support attention sinks [TAG_ATTN_SINKS] |
| TODO |
llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2507 |
padding to support |
| TODO |
llama.cpp/ggml/src/ggml-common.h:1087 |
fix name to kvalues_iq4_nl |
| TODO |
llama.cpp/ggml/src/ggml-cpu/CMakeLists.txt:503 |
Separation to determine activation of VX/VXE/VXE2 |
| TODO |
llama.cpp/ggml/src/ggml-cpu/amx/amx.cpp:152 |
not sure if correct (https://github.com/ggml-org/llama.cpp/pull/16315) |
| TODO |
llama.cpp/ggml/src/ggml-cpu/amx/common.h:83 |
fix padding for vnni format |
| TODO |
llama.cpp/ggml/src/ggml-cpu/amx/mmq.cpp:510 |
this is reference impl! |
| TODO |
llama.cpp/ggml/src/ggml-cpu/amx/mmq.cpp:2426 |
performance improvement: merge quant A |
| TODO |
llama.cpp/ggml/src/ggml-cpu/arch/wasm/quants.c:382 |
check if unrolling this is better |
| TODO |
llama.cpp/ggml/src/ggml-cpu/arch/wasm/quants.c:475 |
check if unrolling this is better |
| FIXME |
llama.cpp/ggml/src/ggml-cpu/arch/x86/cpu-feats.cpp:264 |
this does not check for OS support |
| TODO |
llama.cpp/ggml/src/ggml-cpu/arch/x86/quants.c:1110 |
can _mm256_mulhi_epu16 be faster even if 16-bits? |
| TODO |
llama.cpp/ggml/src/ggml-cpu/binary-ops.cpp:114 |
Use the 'traits' lookup table (for type conversion fns), instead of a mass of 'if' conditions with long templates |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ggml-cpu-impl.h:170 |
double-check these work correctly |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ggml-cpu-impl.h:521 |
move to ggml-threading |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:123 |
add support for explicit memory order |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:130 |
add support for explicit memory order |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:137 |
add support for explicit memory order |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:1206 |
this is a bit of a hack, we should probably have a better way to handle this |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:1263 |
extract to "extra_op" |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:1477 |
this is a bit of a hack, we should probably have a better way to handle this |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:2159 |
Windows etc. |
| FIXME |
llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:2280 |
get_rows can use additional threads, but the cost of launching additional threads |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:2431 |
support > 64 CPUs |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:2524 |
there seems to be no way to set lower prio on Apple platforms |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:2547 |
this may not work on BSD, to be verified |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:2893 |
this can become (n_tasks-1) |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:2896 |
this can become (n_tasks-1) |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:2899 |
this can become (n_tasks-1) |
| FIXME |
llama.cpp/ggml/src/ggml-cpu/ggml-cpu.cpp:136 |
deep copy |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ggml-cpu.cpp:665 |
move to ggml-base |
| FIXME |
llama.cpp/ggml/src/ggml-cpu/llamafile/sgemm.cpp:303 |
this should check for __ARM_FEATURE_FP16_VECTOR_ARITHMETIC |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:1708 |
support for transposed / permuted tensors |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:1712 |
maybe this is not optimal? |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:1752 |
support for transposed / permuted tensors |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:1756 |
maybe this is not optimal? |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:1797 |
templateify the implemenation and support for I64 |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:1832 |
support for transposed / permuted tensors |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:1850 |
maybe this is not optimal? |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:1913 |
smarter multi-theading |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:1956 |
smarter multi-theading |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:1999 |
smarter multi-theading |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:2042 |
smarter multi-theading |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:3729 |
optimize |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:3798 |
optimize |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:3970 |
optimize |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:4070 |
optimize |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:4410 |
add x parameter to ggml_vec_scale_f32 and remove this memcpy |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:5086 |
handle transposed/permuted matrices |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:5165 |
handle transposed/permuted matrices |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:5253 |
is this supposed to be ceil instead of floor? |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:5378 |
handle transposed/permuted matrices |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:7713 |
optimize |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:8556 |
on ARM, native f16 should be faster |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:9214 |
transpose the output for smaller strides for big batches? |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:9331 |
maybe unroll more? |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:9419 |
what happens when (d_state % svcntw()) != 0? |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:9495 |
optimize / multi-thread |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:9562 |
optimize / multi-thread |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:10403 |
Write SVE code and RVV code |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:10658 |
handle transposed/permuted matrices |
| TODO |
llama.cpp/ggml/src/ggml-cpu/ops.cpp:10756 |
handle transposed/permuted matrices |
| TODO |
llama.cpp/ggml/src/ggml-cpu/quants.c:151 |
add WASM SIMD |
| TODO |
llama.cpp/ggml/src/ggml-cpu/repack.cpp:2379 |
this branch seems wrong |
| TODO |
llama.cpp/ggml/src/ggml-cpu/repack.cpp:2500 |
generalise. |
| TODO |
llama.cpp/ggml/src/ggml-cpu/repack.cpp:2541 |
needs to be revisited |
| TODO |
llama.cpp/ggml/src/ggml-cpu/repack.cpp:2805 |
General batched mul mat for 4D tensors |
| TODO |
llama.cpp/ggml/src/ggml-cpu/simd-mappings.h:468 |
is this optimal ? |
| TODO |
llama.cpp/ggml/src/ggml-cpu/simd-mappings.h:568 |
is this optimal ? |
| TODO |
llama.cpp/ggml/src/ggml-cpu/simd-mappings.h:862 |
Does this work? |
| TODO |
llama.cpp/ggml/src/ggml-cpu/simd-mappings.h:886 |
is this optimal ? |
| TODO |
llama.cpp/ggml/src/ggml-cpu/simd-mappings.h:978 |
is this optimal ? |
| TODO |
llama.cpp/ggml/src/ggml-cpu/unary-ops.cpp:135 |
Use the 'traits' lookup table (for type conversion fns), instead of a mass of 'if' conditions with long templates |
| TODO |
llama.cpp/ggml/src/ggml-cpu/vec.cpp:475 |
optimize to process the remaining elements in groups using the smaller vector sizes from AVX2 and SSE |
| TODO |
llama.cpp/ggml/src/ggml-cpu/vec.h:609 |
Write SVE code |
| TODO |
llama.cpp/ggml/src/ggml-cpu/vec.h:672 |
Write SVE code |
| TODO |
llama.cpp/ggml/src/ggml-cpu/vec.h:950 |
optimize performance |
| TODO |
llama.cpp/ggml/src/ggml-cuda/CMakeLists.txt:60 |
Remove once CCCL 3.2 has been released and bundled with CUDA Toolkit |
| TODO |
llama.cpp/ggml/src/ggml-hexagon/ggml-hexagon.cpp:190 |
might need to bail out if the HTP is stuck on something |
| TODO |
llama.cpp/ggml/src/ggml-hexagon/ggml-hexagon.cpp:205 |
handle errors |
| TODO |
llama.cpp/ggml/src/ggml-hexagon/ggml-hexagon.cpp:208 |
update profiling implementation, currently only works for opt_opsync mode |
| TODO |
llama.cpp/ggml/src/ggml-hexagon/ggml-hexagon.cpp:1825 |
support broadcast for ne[2 and 3] |
| TODO |
llama.cpp/ggml/src/ggml-hexagon/ggml-hexagon.cpp:1981 |
add support for non-contigiuos tensors |
| TODO |
llama.cpp/ggml/src/ggml-hexagon/ggml-hexagon.cpp:2000 |
add support for non-contigiuos tensors |
| FIXME |
llama.cpp/ggml/src/ggml-hexagon/ggml-hexagon.cpp:2047 |
add support for sinks |
| FIXME |
llama.cpp/ggml/src/ggml-hexagon/ggml-hexagon.cpp:2166 |
add support for GGML_TYPE_F16 for src0 |
| TODO |
llama.cpp/ggml/src/ggml-hexagon/ggml-hexagon.cpp:2727 |
the current version might do incorrect reodering in cases where quantized src0 |
| TODO |
llama.cpp/ggml/src/ggml-hexagon/htp/hex-dma.h:32 |
technically we don't need these and could use Q6_dmstart/wait/etc instead |
| FIXME |
llama.cpp/ggml/src/ggml-hexagon/htp/matmul-ops.c:930 |
might need to handle zero as a special case (see ggml-cpu code) |
| FIXME |
llama.cpp/ggml/src/ggml-hexagon/htp/matmul-ops.c:962 |
might need to handle zero as a special case (see ggml-cpu code) |
| FIXME |
llama.cpp/ggml/src/ggml-hexagon/htp/matmul-ops.c:1044 |
might need to handle zero as a special case (see ggml-cpu code) |
| FIXME |
llama.cpp/ggml/src/ggml-hexagon/htp/matmul-ops.c:1085 |
might need to handle zero as a special case (see ggml-cpu code) |
| FIXME |
llama.cpp/ggml/src/ggml-hexagon/htp/matmul-ops.c:1186 |
might need to handle zero as a special case (see ggml-cpu code) |
| FIXME |
llama.cpp/ggml/src/ggml-hexagon/htp/matmul-ops.c:1241 |
might need to handle zero as a special case (see ggml-cpu code) |
| TODO |
llama.cpp/ggml/src/ggml-hexagon/htp/rope-ops.c:334 |
use simd to speed up the remaining elements copy |
| TODO |
llama.cpp/ggml/src/ggml-hip/CMakeLists.txt:86 |
do not use CUDA definitions for HIP |
| TODO |
llama.cpp/ggml/src/ggml-impl.h:72 |
move to ggml.h? (won't be able to inline) |
| TODO |
llama.cpp/ggml/src/ggml-impl.h:603 |
Consider allowing GGML_OP_NONE nodes in between |
| FIXME |
llama.cpp/ggml/src/ggml-metal/CMakeLists.txt:103 |
only add to the ggml-metal target? |
| TODO |
llama.cpp/ggml/src/ggml-metal/ggml-metal-impl.h:9 |
for optimal performance, become function of the device and work size |
| TODO |
llama.cpp/ggml/src/ggml-metal/ggml-metal-ops.cpp:56 |
this can be removed when the allocator starts filtering them earlier |
| TODO |
llama.cpp/ggml/src/ggml-metal/ggml-metal-ops.cpp:632 |
make a simpler cpy_bytes kernel |
| TODO |
llama.cpp/ggml/src/ggml-metal/ggml-metal-ops.cpp:1629 |
relax this constraint in the future |
| TODO |
llama.cpp/ggml/src/ggml-metal/ggml-metal-ops.cpp:1816 |
helper function |
| TODO |
llama.cpp/ggml/src/ggml-metal/ggml-metal-ops.cpp:1836 |
determine the optimal parameters based on grid utilization |
| TODO |
llama.cpp/ggml/src/ggml-musa/CMakeLists.txt:73 |
do not use CUDA definitions for MUSA |
| TODO |
llama.cpp/ggml/src/ggml-musa/CMakeLists.txt:107 |
mudnn has not provided static libraries yet |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:2918 |
initialize them for non SMALL_PATH path, or remove them. |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:3268 |
add support |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:3270 |
implement BF16, Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, IQ4_NL support (https://github.com/ggml-org/llama.cpp/pull/14661)") |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:3374 |
add circular padding support for opencl, see https://github.com/ggml-org/llama.cpp/pull/16985 |
| FIXME |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:3771 |
if any unexpected results are seen, double check the offset - |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:3916 |
use preallocated images instead of sub-buffer then image |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:5146 |
find the optimal values for these |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:8515 |
remove duplicate definitions of image description + format -- move to top |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9052 |
Unknown GPU"); |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9091 |
add block_q4_0 variant. |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9110 |
Unknown GPU"); |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9147 |
Unknown GPU"); |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9209 |
Unknown GPU"); |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9245 |
Unknown GPU"); |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9282 |
Unknown GPU"); |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9319 |
Unknown GPU"); |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9358 |
Unknown GPU"); |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9396 |
Unknown GPU"); |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9428 |
Unknown GPU"); |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9466 |
Unknown GPU"); |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9499 |
Unknown GPU"); |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9655 |
Unknown GPU"); |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9699 |
Unknown GPU"); |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9735 |
Unknown GPU"); |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9879 |
Unknown GPU"); |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9918 |
Unknown GPU"); |
| TODO |
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:10258 |
Unknown GPU"); |
| TODO |
llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp:481 |
currently the output_size is always known, do we need support for commands with variable output size? |
| TODO |
llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp:789 |
cache the alloc responses to avoid extra RPC calls? |
| TODO |
llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp:1932 |
obtain value from the server |
| TODO |
llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp:1970 |
call the remote backend and cache the results |
| TODO |
llama.cpp/ggml/src/ggml-sycl/common.hpp:84 |
adapt to hardwares |
| TODO |
llama.cpp/ggml/src/ggml-sycl/common.hpp:87 |
currently, it's not used for XMX really. |
| TODO |
llama.cpp/ggml/src/ggml-sycl/convert.cpp:517 |
Downsample logic is separated from the kernel, a rewrite is desirable |
| TODO |
llama.cpp/ggml/src/ggml-sycl/getrows.cpp:180 |
Refactor and remove duplicates */ |
| TODO |
llama.cpp/ggml/src/ggml-sycl/getrows.cpp:211 |
k-quants |
| FIXME |
llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:863 |
do not crash if SYCL Buffer alloc fails |
| FIXME |
llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:1118 |
this is not thread safe |
| FIXME |
llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:1187 |
this is a hack to avoid having to implement a new buffer type |
| TODO |
llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:1202 |
return device.maxBufferLength |
| TODO |
llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:2561 |
check that src0->buffer->buft is a split buffer type, replace GGML_BACKEND_TYPE_GPU_SPLIT check |
| TODO |
llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:2965 |
see https://github.com/ggml-org/llama.cpp/pull/13155 |
| TODO |
llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:3211 |
accuracy issues in MMQ |
| TODO |
llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:3536 |
Refactor and cleanup of mul mat dispatching. |
| TODO |
llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:3913 |
more efficient implementation |
| TODO |
llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:4459 |
update for the new |
| TODO |
llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:4641 |
support GGML_TYPE_BF16 |
| FIXME |
llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:4642 |
keep a list of supported types to avoid breaking the backend when a new type is added |
| TODO |
llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:4646 |
The configuration below needs more work to be supported with oneDNN |
| TODO |
llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:4652 |
This specific configuration can fail with oneDNN and needs more debugging |
| TODO |
llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:4846 |
add circular padding support for syscl, see https://github.com/ggml-org/llama.cpp/pull/16985 |
| TODO |
llama.cpp/ggml/src/ggml-sycl/softmax.cpp:67 |
noncontigous inputs/outputs |
| TODO |
llama.cpp/ggml/src/ggml-sycl/sycl_hw.cpp:3 |
currently not used |
| TODO |
llama.cpp/ggml/src/ggml-sycl/sycl_hw.hpp:13 |
currently not used |
| TODO |
llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:3167 |
We're no longer benefitting from the async compiles (shaders are |
| TODO |
llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:5258 |
Use pointer or reference to avoid copy |
| TODO |
llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:6520 |
staging_offset is not used |
| TODO |
llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:14371 |
enable async and synchronize |
| TODO |
llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:486 |
error handling |
| TODO |
llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:723 |
handle multiple pipeline names |
| TODO |
llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:2361 |
optional, needed? |
| TODO |
llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:2365 |
optional, implement this |
| TODO |
llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:2367 |
optional, think it coordinates with .init_tensor |
| TODO |
llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:2453 |
for now, return maxBufferSize as both free and total memory |
| TODO |
llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:2868 |
track need for these toggles: https://issues.chromium.org/issues/42251215 |
| TODO |
llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:2965 |
Maybe WebGPU needs a "fast" mode where you can request compilers skip adding checks like these, |
| TODO |
llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:3150 |
support non-contiguous tensors, e.g. for MOE_EXPERT_REDUCE |
| TODO |
llama.cpp/ggml/src/ggml-zdnn/ggml-zdnn.cpp:22 |
implement support for quantized types |
| TODO |
llama.cpp/ggml/src/ggml-zdnn/ggml-zdnn.cpp:609 |
make thread-safe |
| TODO |
llama.cpp/ggml/src/ggml-zdnn/mmf.cpp:70 |
Remove in the future as we are currently DLF16 -> FP32 then in the next op, FP32 -> DLF16 again. Inefficient. |
| TODO |
llama.cpp/ggml/src/ggml-zdnn/utils.cpp:71 |
Consider adding a ggml check. |
| TODO |
llama.cpp/ggml/src/ggml-zdnn/utils.cpp:72 |
If tensor = 4D, use ZDNN_NCHW by default. |
| TODO |
llama.cpp/ggml/src/ggml-zdnn/utils.cpp:73 |
If tensor = 2D, use ZDNN_NHWC by default. |
| FIXME |
llama.cpp/ggml/src/ggml.c:10 |
required here for quantization functions |
| TODO |
llama.cpp/ggml/src/ggml.c:1723 |
this should not be needed as long as we don't rely on aligned SIMD loads |
| TODO |
llama.cpp/ggml/src/ggml.c:1992 |
support less-strict constraint |
| TODO |
llama.cpp/ggml/src/ggml.c:3788 |
implement non F32 return |
| TODO |
llama.cpp/ggml/src/ggml.c:3812 |
implement non F32 return |
| TODO |
llama.cpp/ggml/src/ggml.c:4320 |
when implement backward, fix this: |
| TODO |
llama.cpp/ggml/src/ggml.c:4923 |
implement antialias for modes other than bilinear |
| TODO |
llama.cpp/ggml/src/ggml.c:5264 |
check if vT can be multiplied by (k*qT) |
| TODO |
llama.cpp/ggml/src/ggml.c:5341 |
adapt to ggml_flash_attn_ext() changes"); |
| TODO |
llama.cpp/ggml/src/ggml.c:5344 |
check if vT can be multiplied by (k*qT) |
| TODO |
llama.cpp/ggml/src/ggml.c:5417 |
maybe support other strides than 1? |
| TODO |
llama.cpp/ggml/src/ggml.c:6093 |
support other variants |
| TODO |
llama.cpp/ggml/src/ggml.c:6287 |
should probably be sum instead of mean |
| TODO |
llama.cpp/ggml/src/ggml.c:6799 |
this branch isn't accessible anymore, maybe move this to ggml_build_forward_expand |
| FIXME |
llama.cpp/ggml/src/ggml.c:7421 |
use ggml-backend to obtain the tensor data |
| TODO |
llama.cpp/gguf-py/gguf/constants.py:3666 |
add GGMLFileType from ggml_ftype in ggml.h |
| TODO |
llama.cpp/gguf-py/gguf/constants.py:3746 |
need help with 64-bit types in Python |
| FIXME |
llama.cpp/gguf-py/gguf/gguf_reader.py:73 |
When/if _get_field_parts() support multi-dimensional arrays, this must do so too |
| TODO |
llama.cpp/gguf-py/gguf/gguf_reader.py:205 |
add option to generate error on duplicate keys |
| FIXME |
llama.cpp/gguf-py/gguf/gguf_reader.py:243 |
Handle multi-dimensional arrays properly instead of flattening |
| TODO |
llama.cpp/gguf-py/gguf/gguf_writer.py:425 |
cleaner way to get the first key |
| TODO |
llama.cpp/gguf-py/gguf/lazy.py:49 |
make this even more comprehensive |
| TODO |
llama.cpp/gguf-py/gguf/lazy.py:101 |
dict and set |
| TODO |
llama.cpp/gguf-py/gguf/lazy.py:122 |
maybe handle tensors in kwargs too |
| TODO |
llama.cpp/gguf-py/gguf/lazy.py:228 |
__array_function__ |
| TODO |
llama.cpp/gguf-py/gguf/metadata.py:72 |
load adapter_config.json when possible, it usually contains the base model of the LoRA adapter |
| TODO |
llama.cpp/gguf-py/gguf/metadata.py:325 |
should word-based size labels always be removed instead? |
| TODO |
llama.cpp/gguf-py/gguf/metadata.py:354 |
should the basename version always be excluded? |
| TODO |
llama.cpp/gguf-py/gguf/tensor_mapping.py:1210 |
these do not belong to block_mappings_cfg - move them to mappings_cfg |
| TODO |
llama.cpp/gguf-py/gguf/utility.py:87 |
handle request errors (maybe with limited retries?) |
| TODO |
llama.cpp/gguf-py/gguf/vocab.py:163 |
internally store as the new format instead of converting to old |
| FIXME |
llama.cpp/gguf-py/gguf/vocab.py:369 |
Verify that added tokens here _cannot_ overlap with the main vocab. |
| TODO |
llama.cpp/gguf-py/tests/test_metadata.py:110 |
hf suffix which could be ignored but isn't |
| TODO |
llama.cpp/gguf-py/tests/test_metadata.py:142 |
DPO in the name |
| TODO |
llama.cpp/gguf-py/tests/test_metadata.py:151 |
should "base" be a 'finetune' or 'size_label'? |
| TODO |
llama.cpp/gguf-py/tests/test_quants.py:107 |
is a column-wise sum of squares appropriate? |
| TODO |
llama.cpp/include/llama.h:57 |
show sample usage |
| TODO |
llama.cpp/include/llama.h:90 |
remove, required until per token attributes are available from GGUF file |
| TODO |
llama.cpp/include/llama.h:197 |
simplify (https://github.com/ggml-org/llama.cpp/pull/9294#pullrequestreview-2286561979) |
| TODO |
llama.cpp/include/llama.h:205 |
consider SoA |
| TODO |
llama.cpp/include/llama.h:239 |
rename this to "output" |
| TODO |
llama.cpp/include/llama.h:417 |
update API to start accepting pointers to params structs (https://github.com/ggml-org/llama.cpp/discussions/9172) |
| TODO |
llama.cpp/include/llama.h:532 |
rename to llama_get_pooling_type |
| TODO |
llama.cpp/include/llama.h:955 |
rename to avoid confusion with llama_get_embeddings() |
| TODO |
llama.cpp/include/llama.h:979 |
deprecate in favor of llama_get_logits_ith() (ref: https://github.com/ggml-org/llama.cpp/pull/14853#issuecomment-3113143522) |
| TODO |
llama.cpp/include/llama.h:994 |
deprecate in favor of llama_get_embeddings_ith() (ref: https://github.com/ggml-org/llama.cpp/pull/14853#issuecomment-3113143522) |
| TODO |
llama.cpp/include/llama.h:1469 |
extend in the future |
| TODO |
llama.cpp/scripts/check-requirements.sh:172 |
the check is failing for some reason: |
| TODO |
llama.cpp/src/llama-adapter.cpp:288 |
add support for norm vector |
| TODO |
llama.cpp/src/llama-adapter.cpp:297 |
a more general solution for non-CPU extra buft should be imlpemented in the future |
| TODO |
llama.cpp/src/llama-adapter.h:11 |
pimpl |
| TODO |
llama.cpp/src/llama-batch.h:32 |
whole_seqs for embeddings? |
| TODO |
llama.cpp/src/llama-batch.h:113 |
support embeddings if needed in the future |
| TODO |
llama.cpp/src/llama-batch.h:131 |
this is more of a temporary solution until we have a better way to handle multiple positions per token/embd |
| TODO |
llama.cpp/src/llama-context.cpp:105 |
start reading the actual value of mscale and handle the case where it is not 1.0f |
| TODO |
llama.cpp/src/llama-context.cpp:305 |
move these checks to ggml_backend_sched |
| TODO |
llama.cpp/src/llama-context.cpp:320 |
should we ignore ACCEL types too? |
| TODO |
llama.cpp/src/llama-context.cpp:436 |
instead of the tensor names, use a map to keep track of which (FA) tensors belong to which layer |
| FIXME |
llama.cpp/src/llama-context.cpp:444 |
fa_device_mismatch logic is wrong for --no-kv-offload, but this is broken anyways |
| TODO |
llama.cpp/src/llama-context.cpp:500 |
not sure if the following graph would be worster case for multi-stream KV caches: |
| FIXME |
llama.cpp/src/llama-context.cpp:548 |
if multiple single tokens are evaluated without a synchronization, |
| TODO |
llama.cpp/src/llama-context.cpp:645 |
change the mctx->apply() to return information if a graph reserve is needed |
| TODO |
llama.cpp/src/llama-context.cpp:722 |
use output_resolve_row() |
| TODO |
llama.cpp/src/llama-context.cpp:773 |
use output_resolve_row() |
| TODO |
llama.cpp/src/llama-context.cpp:987 |
not sure yet if we want to reserve here |
| TODO |
llama.cpp/src/llama-context.cpp:1112 |
should we reserve? |
| TODO |
llama.cpp/src/llama-context.cpp:1203 |
add new split mode where we pad the input sequences so that ubatch.equal_seqs == true |
| TODO |
llama.cpp/src/llama-context.cpp:1213 |
this clear of the buffer can easily be forgotten - need something better |
| TODO |
llama.cpp/src/llama-context.cpp:1235 |
this is a tmp solution until we have a proper way to support enc-dec models |
| TODO |
llama.cpp/src/llama-context.cpp:1317 |
hacky solution |
| TODO |
llama.cpp/src/llama-context.cpp:1493 |
avoid this workaround in the future |
| TODO |
llama.cpp/src/llama-context.cpp:1543 |
this clear of the buffer can easily be forgotten - need something better |
| TODO |
llama.cpp/src/llama-context.cpp:1783 |
is there something more efficient which also minimizes swaps? |
| TODO |
llama.cpp/src/llama-context.cpp:1833 |
hacky enc-dec support |
| TODO |
llama.cpp/src/llama-context.cpp:1864 |
also consider shrinking the buffer |
| TODO |
llama.cpp/src/llama-context.cpp:1873 |
not needed? |
| TODO |
llama.cpp/src/llama-context.cpp:2039 |
not sure if needed, might simplify in the future by removing this |
| FIXME |
llama.cpp/src/llama-context.cpp:2144 |
fix in ggml_backend_sched |
| TODO |
llama.cpp/src/llama-context.cpp:2491 |
add more model-specific info which should prevent loading the session file if not identical |
| TODO |
llama.cpp/src/llama-context.cpp:2549 |
handle sampling buffers and samplers state ? |
| TODO |
llama.cpp/src/llama-context.cpp:2574 |
add more info which needs to be identical but which is not verified otherwise |
| TODO |
llama.cpp/src/llama-context.cpp:2638 |
handle sampling buffers and samplers state ? |
| TODO |
llama.cpp/src/llama-context.cpp:2831 |
handle this error"); |
| TODO |
llama.cpp/src/llama-context.cpp:2948 |
better default |
| TODO |
llama.cpp/src/llama-context.h:188 |
more flexible combinations of logical/physical batch size and context size |
| TODO |
llama.cpp/src/llama-context.h:251 |
read/write lora adapters and cvec |
| TODO |
llama.cpp/src/llama-context.h:268 |
tmp for handling cross-attention - need something better probably |
| TODO |
llama.cpp/src/llama-grammar.h:71 |
remove, needed for tests atm |
| TODO |
llama.cpp/src/llama-grammar.h:133 |
shared ptr |
| TODO |
llama.cpp/src/llama-grammar.h:178 |
move the API below as member functions of llama_grammar |
| TODO |
llama.cpp/src/llama-graph.cpp:99 |
use ubatch->n_seqs instead of failing |
| TODO |
llama.cpp/src/llama-graph.cpp:404 |
need to move this to the unified cache and check there |
| TODO |
llama.cpp/src/llama-graph.cpp:453 |
need to move this to the unified cache and check there |
| TODO |
llama.cpp/src/llama-graph.cpp:456 |
need to move this to the unified cache and check there |
| TODO |
llama.cpp/src/llama-graph.cpp:474 |
use ubatch->n_seqs instead of failing |
| TODO |
llama.cpp/src/llama-graph.cpp:522 |
need to move this to the unified cache and check there |
| TODO |
llama.cpp/src/llama-graph.cpp:538 |
Hybrid input classes are a bit redundant. |
| TODO |
llama.cpp/src/llama-graph.cpp:626 |
need to move this to the unified cache and check there |
| TODO |
llama.cpp/src/llama-graph.cpp:635 |
need to move this to the unified cache and check there |
| TODO |
llama.cpp/src/llama-graph.cpp:1266 |
Use scalar div instead when/if implemented |
| TODO |
llama.cpp/src/llama-graph.cpp:1376 |
move to hparams? |
| TODO |
llama.cpp/src/llama-graph.cpp:1392 |
add support for gated squared relu |
| TODO |
llama.cpp/src/llama-graph.cpp:1612 |
needs more work to be correct, for now just use the tensor shape |
| TODO |
llama.cpp/src/llama-graph.cpp:1856 |
if ubatch.equal_seqs() == true, we can split the three tensors below into ubatch.n_seqs_unq streams |
| TODO |
llama.cpp/src/llama-graph.cpp:1927 |
remove |
| TODO |
llama.cpp/src/llama-graph.cpp:2183 |
maybe separate the inner implementation into a separate function |
| TODO |
llama.cpp/src/llama-graph.cpp:2588 |
Call llama_sampler_accept_ggml after all samplers have been applied. |
| TODO |
llama.cpp/src/llama-graph.h:58 |
tmp - need something better to pass the data from the encoder to the decoder |
| TODO |
llama.cpp/src/llama-graph.h:61 |
this needs more work to be correct, for now copy the embeddings data to host memory |
| TODO |
llama.cpp/src/llama-graph.h:738 |
needed by build_attn_mha, figure out a way to remove? |
| TODO |
llama.cpp/src/llama-graph.h:897 |
remove |
| TODO |
llama.cpp/src/llama-graph.h:951 |
move this implementation to llama_memory_recurrent. |
| TODO |
llama.cpp/src/llama-graph.h:1020 |
better name |
| TODO |
llama.cpp/src/llama-hparams.cpp:149 |
maybe support other convolution strides than 1 |
| TODO |
llama.cpp/src/llama-hparams.h:292 |
think of a better place for this function |
| TODO |
llama.cpp/src/llama-hparams.h:293 |
pack the SWA params in a struct? |
| TODO |
llama.cpp/src/llama-impl.h:64 |
rename to llama_format ? |
| TODO |
llama.cpp/src/llama-kv-cache-iswa.cpp:206 |
if we fail again, we should attempt different splitting strategies |
| TODO |
llama.cpp/src/llama-kv-cache.cpp:1086 |
add ggml helper function for this? |
| TODO |
llama.cpp/src/llama-kv-cache.cpp:1485 |
support multiple streams"); |
| TODO |
llama.cpp/src/llama-kv-cache.cpp:1489 |
use ubatch->n_seqs instead of failing |
| TODO |
llama.cpp/src/llama-kv-cache.cpp:1760 |
we also need to save llama_kv_cell_ext when apply_ubatch() support loading it |
| TODO |
llama.cpp/src/llama-kv-cache.cpp:1912 |
we cannot yet restore llama_kv_cell_ext as the apply_ubatch() does not support it yet |
| TODO |
llama.cpp/src/llama-kv-cells.h:31 |
add unit tests |
| TODO |
llama.cpp/src/llama-memory-hybrid-iswa.cpp:76 |
non-sequential equal split can be done if using unified KV cache |
| TODO |
llama.cpp/src/llama-memory-hybrid-iswa.cpp:95 |
will the recurrent cache be in an undefined context at this point? |
| TODO |
llama.cpp/src/llama-memory-hybrid.cpp:76 |
non-sequential equal split can be done if using unified KV cache |
| TODO |
llama.cpp/src/llama-memory-hybrid.cpp:95 |
will the recurrent cache be in an undefined context at this point? |
| TODO |
llama.cpp/src/llama-memory-recurrent.cpp:390 |
non-sequential equal split can be done if using unified KV cache |
| TODO |
llama.cpp/src/llama-memory-recurrent.cpp:430 |
optimize |
| TODO |
llama.cpp/src/llama-memory-recurrent.cpp:482 |
would it be possible to resize the cache instead? |
| TODO |
llama.cpp/src/llama-memory-recurrent.cpp:623 |
bake-in src refcounts in the cell metadata |
| TODO |
llama.cpp/src/llama-memory-recurrent.cpp:931 |
llama_memory_recurrent should have a notion of max sequences |
| TODO |
llama.cpp/src/llama-memory-recurrent.h:15 |
extract the cache state used for graph computation into llama_memory_recurrent_context_i |
| TODO |
llama.cpp/src/llama-memory-recurrent.h:78 |
optimize for recurrent state needs |
| TODO |
llama.cpp/src/llama-memory-recurrent.h:178 |
extract all the state like `head` and `n` here |
| TODO |
llama.cpp/src/llama-mmap.cpp:43 |
consider moving to llama-impl.h if needed in more places |
| TODO |
llama.cpp/src/llama-model-loader.cpp:496 |
this is not very clever - figure out something better |
| TODO |
llama.cpp/src/llama-model-loader.cpp:659 |
make optional |
| TODO |
llama.cpp/src/llama-model-saver.cpp:202 |
implement split file support |
| TODO |
llama.cpp/src/llama-model-saver.cpp:247 |
implement LoRA support |
| TODO |
llama.cpp/src/llama-model.cpp:593 |
Handle SWA metadata similarly when models start implementing it |
| TODO |
llama.cpp/src/llama-model.cpp:853 |
become GGUF KV parameter |
| TODO |
llama.cpp/src/llama-model.cpp:876 |
become GGUF KV parameter |
| TODO |
llama.cpp/src/llama-model.cpp:995 |
become GGUF KV parameter |
| TODO |
llama.cpp/src/llama-model.cpp:1200 |
fix conversion scripts to correctly populate `n_swa` and `n_swa_pattern` |
| TODO |
llama.cpp/src/llama-model.cpp:1515 |
Jamba layers are a bit heterogenous, so naming this is hard. |
| TODO |
llama.cpp/src/llama-model.cpp:1815 |
when MTP is implemented, this should probably be updated if needed |
| TODO |
llama.cpp/src/llama-model.cpp:1883 |
add variants */ |
| TODO |
llama.cpp/src/llama-model.cpp:2157 |
when MTP is implemented, this should probably be updated if needed |
| TODO |
llama.cpp/src/llama-model.cpp:2488 |
maybe add n_attn_temp_floor_scale as a separate KV? |
| TODO |
llama.cpp/src/llama-model.cpp:2893 |
move to a separate function |
| FIXME |
llama.cpp/src/llama-model.cpp:7464 |
workaround for CPU backend buft having a NULL device |
| TODO |
llama.cpp/src/llama-model.cpp:8572 |
move reranking logic here and generalize |
| TODO |
llama.cpp/src/llama-model.h:546 |
move this to new llm_arch_model_i interface |
| TODO |
llama.cpp/src/llama-model.h:549 |
move this to new llm_arch_model_i interface |
| TODO |
llama.cpp/src/llama-model.h:562 |
remove |
| TODO |
llama.cpp/src/llama-quant.cpp:181 |
avoid hardcoded tensor names - use the TN_* constants |
| TODO |
llama.cpp/src/llama-quant.cpp:313 |
explore better strategies |
| TODO |
llama.cpp/src/llama-quant.cpp:320 |
explore better strategies |
| TODO |
llama.cpp/src/llama-quant.cpp:589 |
use LLM_KV |
| TODO |
llama.cpp/src/llama-quant.cpp:590 |
use LLM_KV |
| TODO |
llama.cpp/src/llama-quant.cpp:654 |
avoid hardcoded tensor names - use the TN_* constants |
| TODO |
llama.cpp/src/llama-quant.cpp:867 |
use a symmetric type instead |
| TODO |
llama.cpp/src/llama-quant.cpp:985 |
temporary sanity check that the F16 -> MXFP4 is lossless |
| TODO |
llama.cpp/src/llama-sampler.cpp:2548 |
remove trigger_words support. |
| TODO |
llama.cpp/src/llama-vocab.cpp:246 |
there are a lot of common parts between spm and bpe tokenizers, should be refactored and reused |
| TODO |
llama.cpp/src/llama-vocab.cpp:730 |
reduce string copies by using cpts_offs array |
| TODO |
llama.cpp/src/llama-vocab.cpp:1578 |
should we set all of these to LLAMA_TOKEN_NULL? |
| TODO |
llama.cpp/src/llama-vocab.cpp:2131 |
remove, required until per token attributes are available from GGUF file |
| TODO |
llama.cpp/src/llama-vocab.cpp:2230 |
convert scripts should provide these tokens through the KV metadata LLM_KV_TOKENIZER_... |
| TODO |
llama.cpp/src/llama-vocab.cpp:2497 |
workaround for o200k_harmony and solar-open tokenizer: the "<|end|>" token should not be EOG |
| TODO |
llama.cpp/src/llama-vocab.cpp:2574 |
Extract attributes from GGUF file. |
| TODO |
llama.cpp/src/llama-vocab.cpp:3271 |
where do these characters come from? |
| FIXME |
llama.cpp/src/models/bitnet.cpp:153 |
do not use model.tok_embd directly, duplicate as model.output |
| TODO |
llama.cpp/src/models/chameleon.cpp:161 |
this suppresses the output of image tokens, which is required to enable text-only outputs. |
| TODO |
llama.cpp/src/models/gemma3.cpp:19 |
is causal == true correct? might need some changes |
| TODO |
llama.cpp/src/models/gemma3n-iswa.cpp:22 |
is causal == true correct? might need some changes |
| TODO |
llama.cpp/src/models/gemma3n-iswa.cpp:209 |
move this to right after the last KV layer |
| TODO |
llama.cpp/src/models/gemma3n-iswa.cpp:261 |
verify if this is the correct behavior in transformers implementation |
| TODO |
llama.cpp/src/models/graph-context-mamba.cpp:131 |
skip computing output earlier for unused tokens |
| TODO |
llama.cpp/src/models/graph-context-mamba.cpp:244 |
use semistructured matrices to implement state-space duality |
| TODO |
llama.cpp/src/models/graph-context-mamba.cpp:260 |
skip computing output earlier for unused tokens |
| TODO |
llama.cpp/src/models/grovemoe.cpp:100 |
Only do the expert selection and weights once |
| TODO |
llama.cpp/src/models/kimi-linear.cpp:428 |
can this ever be false? |
| TODO |
llama.cpp/src/models/minicpm3.cpp:4 |
if the model varies, these parameters need to be read from the model |
| TODO |
llama.cpp/src/models/minicpm3.cpp:145 |
is this correct? |
| TODO |
llama.cpp/src/models/models.h:6 |
remove in follow-up PR - move to .cpp files |
| TODO |
llama.cpp/src/unicode.h:7 |
reimplement this structure in endian-independent way |
| TODO |
llama.cpp/tests/CMakeLists.txt:156 |
disabled on loongarch64 because the ggml-ci node lacks Python 3.8 |
| TODO |
llama.cpp/tests/CMakeLists.txt:171 |
disabled due to slowness |
| TODO |
llama.cpp/tests/CMakeLists.txt:232 |
repair known memory leaks |
| TODO |
llama.cpp/tests/test-backend-ops.cpp:2289 |
Make a template or something |
| TODO |
llama.cpp/tests/test-backend-ops.cpp:3132 |
implement |
| TODO |
llama.cpp/tests/test-backend-ops.cpp:4621 |
add test with a non-contiguous view as input ; this case is needed for build_rope_2d in clip.cpp |
| TODO |
llama.cpp/tests/test-backend-ops.cpp:6145 |
this branch should become a separate test case parameter instead of hardcoding this for these head shapes |
| TODO |
llama.cpp/tests/test-backend-ops.cpp:6965 |
implement for all backends |
| TODO |
llama.cpp/tests/test-backend-ops.cpp:6977 |
or "other" |
| TODO |
llama.cpp/tests/test-backend-ops.cpp:6988 |
implement for all backends |
| TODO |
llama.cpp/tests/test-backend-ops.cpp:7486 |
add after WebGPU is fixed |
| TODO |
llama.cpp/tests/test-backend-ops.cpp:8908 |
better value for n_threads |
| TODO |
llama.cpp/tests/test-backend-sampler.cpp:734 |
biasing too much here makes the Vulkan sampling fail - should be investigated further |
| TODO |
llama.cpp/tests/test-chat-template.cpp:625 |
llama_chat_format_single will be deprecated, remove these tests later |
| TODO |
llama.cpp/tests/test-chat.cpp:121 |
extract to common helper (copied from test-grammar-integration.cpp) |
| TODO |
llama.cpp/tests/test-grammar-integration.cpp:1414 |
The following line should fail, but currently it passes. `exclusiveMinimum` is not supported, as it would likely be too difficult to implement. |
| TODO |
llama.cpp/tests/test-grammar-integration.cpp:1421 |
The following line should fail, but currently it passes. `uniqueItems` is not supported, as it would likely be too difficult to implement. |
| TODO |
llama.cpp/tests/test-grammar-llguidance.cpp:1083 |
The following line should fail, but currently it passes. `uniqueItems` is not supported, as it would likely be too difficult to implement. |
| TODO |
llama.cpp/tests/test-grammar-parser.cpp:7 |
shold not include libllama sources |
| TODO |
llama.cpp/tests/test-json-partial.cpp:153 |
detect the true/false/null literal was complete |
| FIXME |
llama.cpp/tests/test-quantize-fns.cpp:63 |
why is done twice? |
| TODO |
llama.cpp/tests/test-regex-partial.cpp:265 |
((?:b)?a*+).* ?? |
| TODO |
llama.cpp/tools/cli/cli.cpp:68 |
show progress |
| TODO |
llama.cpp/tools/cli/cli.cpp:75 |
reduce some copies here in the future |
| TODO |
llama.cpp/tools/cli/cli.cpp:152 |
support remote files in the future (http, https, etc) |
| TODO |
llama.cpp/tools/cli/cli.cpp:198 |
maybe support it later? |
| TODO |
llama.cpp/tools/cli/cli.cpp:212 |
avoid using atexit() here by making `console` a singleton |
| TODO |
llama.cpp/tools/completion/completion.cpp:916 |
one inconvenient of current chat template implementation is that we can't distinguish between user input and special tokens (prefix/postfix) |
| TODO |
llama.cpp/tools/cvector-generator/cvector-generator.cpp:211 |
get rid of malloc if possible |
| TODO |
llama.cpp/tools/cvector-generator/cvector-generator.cpp:241 |
get rid of this malloc if possible |
| TODO |
llama.cpp/tools/cvector-generator/cvector-generator.cpp:287 |
customize padding token |
| TODO |
llama.cpp/tools/cvector-generator/pca.hpp:72 |
enable Metal support when support for GGML_OP_SQRT is added |
| TODO |
llama.cpp/tools/cvector-generator/pca.hpp:139 |
buf_size must be able to scale with params.n_batch |
| TODO |
llama.cpp/tools/export-lora/export-lora.cpp:193 |
remove this when we can support merging subset of adapters. Ref: https://github.com/ggml-org/llama.cpp/pull/8607#discussion_r1686027777 |
| TODO |
llama.cpp/tools/export-lora/export-lora.cpp:303 |
add support for quantized lora |
| TODO |
llama.cpp/tools/gguf-split/gguf-split.cpp:350 |
detect OS and use copy_file_range() here for better performance |
| TODO |
llama.cpp/tools/imatrix/imatrix.cpp:678 |
extract into its own method; this is also used by the GGUF-based format |
| TODO |
llama.cpp/tools/imatrix/imatrix.cpp:814 |
extract into its own method; this is also used by the legacy format |
| TODO |
llama.cpp/tools/imatrix/imatrix.cpp:1006 |
only get outputs when (params.process_output || params.compute_ppl) |
| TODO |
llama.cpp/tools/mtmd/clip-graph.h:100 |
there was a more efficient which relies on ggml_view and ggml_rope_ext_inplace, but the rope inplace does not work well with non-contiguous tensors ; we should fix that and revert back to the original implementation in https://github.com/ggml-org/llama.cpp/pull/13065 |
| TODO |
llama.cpp/tools/mtmd/clip-impl.h:204 |
improve this later |
| TODO |
llama.cpp/tools/mtmd/clip-model.h:99 |
support warmup size for custom token numbers |
| TODO |
llama.cpp/tools/mtmd/clip-model.h:239 |
rename it to fc (fully connected layer) |
| TODO |
llama.cpp/tools/mtmd/clip.cpp:345 |
q/k norm requires row size == n_embd, while here it's d_head |
| TODO |
llama.cpp/tools/mtmd/clip.cpp:646 |
there was a more efficient which relies on ggml_view and ggml_rope_ext_inplace, but the rope inplace does not work well with non-contiguous tensors ; we should fix that and revert back to the original implementation in https://github.com/ggml-org/llama.cpp/pull/13065 |
| TODO |
llama.cpp/tools/mtmd/clip.cpp:1131 |
verify the image_min_tokens |
| TODO |
llama.cpp/tools/mtmd/clip.cpp:1142 |
check kimivl preprocessor for exact values |
| TODO |
llama.cpp/tools/mtmd/clip.cpp:1464 |
this is a hack to support Yi-type llava |
| TODO |
llama.cpp/tools/mtmd/clip.cpp:2143 |
we don't support audio for Gemma 3N, but GGUF contains audio tensors |
| TODO |
llama.cpp/tools/mtmd/clip.cpp:2269 |
define the behavior for add_padding = false |
| TODO |
llama.cpp/tools/mtmd/clip.cpp:2631 |
this is only used by minicpmv, maybe remove it |
| TODO |
llama.cpp/tools/mtmd/clip.cpp:3994 |
remove this function |
| TODO |
llama.cpp/tools/mtmd/clip.cpp:4002 |
remove this function |
| TODO |
llama.cpp/tools/mtmd/clip.h:61 |
should be enum, not string |
| TODO |
llama.cpp/tools/mtmd/mtmd-audio.cpp:381 |
Handle short audio differently or return error |
| TODO |
llama.cpp/tools/mtmd/mtmd-audio.cpp:400 |
probably unnecessary here? (or better doing it in g_cache?) |
| TODO |
llama.cpp/tools/mtmd/mtmd-audio.cpp:412 |
handle these checks better |
| TODO |
llama.cpp/tools/mtmd/mtmd-audio.cpp:520 |
maybe handle this better |
| TODO |
llama.cpp/tools/mtmd/mtmd-cli.cpp:84 |
support for --system-prompt with /clear command |
| TODO |
llama.cpp/tools/mtmd/mtmd.cpp:702 |
maybe support batching, but this may come with memory cost |
| TODO |
llama.cpp/tools/mtmd/mtmd.h:187 |
deprecate |
| TODO |
llama.cpp/tools/mtmd/mtmd.h:190 |
deprecate |
| TODO |
llama.cpp/tools/mtmd/mtmd.h:192 |
deprecate |
| TODO |
llama.cpp/tools/mtmd/mtmd.h:217 |
deprecate |
| TODO |
llama.cpp/tools/perplexity/perplexity.cpp:869 |
this could be made smaller; it's currently the worst-case size |
| TODO |
llama.cpp/tools/perplexity/perplexity.cpp:905 |
don't evaluate the last token of each sequence |
| TODO |
llama.cpp/tools/perplexity/perplexity.cpp:1145 |
the last token of each of the sequences don't need to be evaluated |
| TODO |
llama.cpp/tools/perplexity/perplexity.cpp:1167 |
this could be made smaller; it's currently the worst-case size |
| TODO |
llama.cpp/tools/perplexity/perplexity.cpp:1199 |
end before the last token, no need to predict past the end of the sequences |
| FIXME |
llama.cpp/tools/perplexity/perplexity.cpp:1244 |
this uses the wrong first logits when not skipping the choice word |
| TODO |
llama.cpp/tools/perplexity/perplexity.cpp:1575 |
don't evaluate the last token of each sequence |
| TODO |
llama.cpp/tools/quantize/quantize.cpp:75 |
share with imatrix.cpp |
| TODO |
llama.cpp/tools/quantize/quantize.cpp:587 |
list multiple datasets when there are more than one |
| TODO |
llama.cpp/tools/server/server-common.cpp:157 |
use the base64::decode from base64.hpp) |
| TODO |
llama.cpp/tools/server/server-common.cpp:939 |
add audio_url support by reusing handle_media() |
| TODO |
llama.cpp/tools/server/server-common.cpp:1003 |
test this properly */ |
| TODO |
llama.cpp/tools/server/server-common.cpp:1049 |
The response format of this option is not yet OAI-compatible, but seems like no one really using it; We may need to fix it in the future |
| TODO |
llama.cpp/tools/server/server-common.cpp:1705 |
reuse llama_detokenize |
| TODO |
llama.cpp/tools/server/server-common.cpp:1847 |
optimize this block by reducing memory allocations and movement |
| TODO |
llama.cpp/tools/server/server-common.cpp:1868 |
make project name an input |
| TODO |
llama.cpp/tools/server/server-common.cpp:1897 |
current filename |
| TODO |
llama.cpp/tools/server/server-common.cpp:1904 |
configurable?) |
| TODO |
llama.cpp/tools/server/server-common.h:152 |
server_tokens should be copyable - remove this: |
| TODO |
llama.cpp/tools/server/server-common.h:303 |
move it to server-task.cpp |
| TODO |
llama.cpp/tools/server/server-common.h:310 |
move it to server-task.cpp |
| TODO |
llama.cpp/tools/server/server-common.h:346 |
move these to server-task.cpp |
| TODO |
llama.cpp/tools/server/server-context.cpp:51 |
change to unique_ptrs for consistency: |
| TODO |
llama.cpp/tools/server/server-context.cpp:59 |
move members that belong to the task (such as `generated_text`, `has_new_line`) to task_results_state |
| TODO |
llama.cpp/tools/server/server-context.cpp:997 |
mtmd does not support prompt cache |
| TODO |
llama.cpp/tools/server/server-context.cpp:1021 |
improve logic |
| TODO |
llama.cpp/tools/server/server-context.cpp:1087 |
This will error out if a user requests two aloras, but only |
| TODO |
llama.cpp/tools/server/server-context.cpp:1154 |
speculative decoding requires multiple samples per batch - not supported yet |
| TODO |
llama.cpp/tools/server/server-context.cpp:1157 |
getting post/pre sampling logits is not yet supported with backend sampling |
| TODO |
llama.cpp/tools/server/server-context.cpp:1160 |
tmp until backend sampling is fully implemented |
| TODO |
llama.cpp/tools/server/server-context.cpp:1256 |
improve by not doing it more than once for each new line |
| TODO |
llama.cpp/tools/server/server-context.cpp:1339 |
optimize this with min-p optimization |
| TODO |
llama.cpp/tools/server/server-context.cpp:1957 |
simplify and improve |
| TODO |
llama.cpp/tools/server/server-context.cpp:2042 |
rework to have a single draft llama_context shared across all slots [TAG_SERVER_SPEC_REWORK] |
| TODO |
llama.cpp/tools/server/server-context.cpp:2127 |
maybe move branch to outside of this loop in the future |
| TODO |
llama.cpp/tools/server/server-context.cpp:2164 |
support memory-less logits computation |
| TODO |
llama.cpp/tools/server/server-context.cpp:2337 |
support can be added in the future when corresponding vision models get released |
| TODO |
llama.cpp/tools/server/server-context.cpp:2476 |
try to make this conditional on the context or the memory module, instead of the model type |
| TODO |
llama.cpp/tools/server/server-context.cpp:2627 |
try to terminate only the largest active slot/sequence and continue with the rest |
| TODO |
llama.cpp/tools/server/server-context.cpp:2637 |
update slot state based on llama_memory_seq_pos_min() and llama_memory_seq_pos_max() |
| TODO |
llama.cpp/tools/server/server-context.cpp:2641 |
handle ret == 2 (abort) when we start aborting |
| TODO |
llama.cpp/tools/server/server-context.cpp:2768 |
set it here instead of doing inside populate_token_probs |
| TODO |
llama.cpp/tools/server/server-context.cpp:2826 |
set result.probs |
| TODO |
llama.cpp/tools/server/server-context.cpp:2963 |
this log can become very long, put it behind a flag or think about a more compact format |
| TODO |
llama.cpp/tools/server/server-context.cpp:2977 |
this is inaccurate due to child tasks |
| TODO |
llama.cpp/tools/server/server-context.cpp:3223 |
get rid of this dynamic_cast |
| TODO |
llama.cpp/tools/server/server-context.cpp:3328 |
get rid of this dynamic_cast |
| TODO |
llama.cpp/tools/server/server-context.cpp:3531 |
this could maybe be multimodal. |
| TODO |
llama.cpp/tools/server/server-http.cpp:357 |
maybe handle sink.write unsuccessful? for now, we rely on is_connection_closed() |
| TODO |
llama.cpp/tools/server/server-http.h:23 |
move this to a virtual function once we have proper polymorphism support |
| TODO |
llama.cpp/tools/server/server-models.cpp:7 |
remove this once we use HTTP client from download.h |
| TODO |
llama.cpp/tools/server/server-models.cpp:153 |
maybe validate preset before rendering ? |
| TODO |
llama.cpp/tools/server/server-models.cpp:196 |
allow refreshing cached model list |
| TODO |
llama.cpp/tools/server/server-models.cpp:800 |
add support for this on web UI |
| TODO |
llama.cpp/tools/server/server-models.cpp:886 |
add other fields, may require reading GGUF metadata |
| TODO |
llama.cpp/tools/server/server-models.h:24 |
also add downloading state when the logic is added |
| TODO |
llama.cpp/tools/server/server-task.cpp:65 |
deduplicate? |
| TODO |
llama.cpp/tools/server/server-task.cpp:123 |
deduplicate? |
| TODO |
llama.cpp/tools/server/server-task.cpp:213 |
implement |
| TODO |
llama.cpp/tools/server/server-task.cpp:279 |
add more sanity checks for the input parameters |
| TODO |
llama.cpp/tools/server/server-task.cpp:413 |
we may want to throw errors here, in case "el" is incorrect |
| TODO |
llama.cpp/tools/server/server-task.cpp:1902 |
for some reason we can't copy server_tokens, so we have to do this workaround |
| TODO |
llama.cpp/tools/server/server-task.h:11 |
prevent including the whole server-common.h as we only use server_tokens |
| TODO |
llama.cpp/tools/server/server-task.h:31 |
change this to more generic "response_format" to replace the "format_response_*" in server-common |
| TODO |
llama.cpp/tools/server/server-task.h:63 |
implement |
| TODO |
llama.cpp/tools/server/server-task.h:500 |
somehow reuse server_metrics in the future, instead of duplicating the fields |
| TODO |
llama.cpp/tools/server/server.cpp:268 |
refactor in common/console |
| TODO |
llama.cpp/tools/server/tests/unit/test_chat_completion.py:254 |
should not be a valid case |
| TODO |
llama.cpp/tools/server/tests/unit/test_completion.py:163 |
remove this once test_cache_vs_nocache_prompt is fixed |
| TODO |
llama.cpp/tools/server/tests/unit/test_completion.py:181 |
remove this once test_cache_vs_nocache_prompt is fixed |
| TODO |
llama.cpp/tools/server/tests/unit/test_completion.py:201 |
remove this once test_cache_vs_nocache_prompt is fixed |
| FIXME |
llama.cpp/tools/server/tests/unit/test_completion.py:369 |
the result is not deterministic when using other slot than slot 0 |
| TODO |
llama.cpp/tools/server/tests/unit/test_lora.py:59 |
remove this once test_cache_vs_nocache_prompt is fixed |
| TODO |
llama.cpp/tools/server/tests/unit/test_lora.py:82 |
find & add other lora adapters for this model |
| TODO |
llama.cpp/tools/server/tests/unit/test_lora.py:108 |
remove this once test_cache_vs_nocache_prompt is fixed |
| TODO |
llama.cpp/tools/server/tests/unit/test_tool_call.py:422 |
fix these (wrong results, either didn't respect decimal instruction or got wrong value) |
| TODO |
llama.cpp/tools/server/webui/src/lib/stores/models.svelte.ts:458 |
Remove this polling once llama-server properly waits for the operation |
| TODO |
llama.cpp/tools/tokenize/tokenize.cpp:2 |
start using log.h |
| TODO |
llama.cpp/tools/tokenize/tokenize.cpp:10 |
remove me |
| TODO |
llama.cpp/tools/tokenize/tokenize.cpp:78 |
potential opportunity to roll common stuff into common/console.cpp |
| TODO |
llama.cpp/tools/tokenize/tokenize.cpp:180 |
reporting invalid_utf8 would be useful on non-Windows too. |
| TODO |
llama.cpp/tools/tts/convert_pt_to_hf.py:4 |
this script is LLM-generated and probably very inefficient and should be rewritten |
| TODO |
llama.cpp/tools/tts/tts-outetts.py:148 |
load from json |
| TODO |
llama.cpp/tools/tts/tts-outetts.py:181 |
tokenization is slow for some reason - here is pre-tokenized input |
| TODO |
llama.cpp/tools/tts/tts.cpp:200 |
not optimized at all |
| TODO |
llama.cpp/tools/tts/tts.cpp:273 |
can be done once |
| TODO |
llama.cpp/tools/tts/tts.cpp:1022 |
all logits? |
| TODO |
nonstd.h:76 |
%s\n", __FILE__, __LINE__, message); \ |
| TODO |
termbox2.h:2416 |
Assert global.back.(width,height) == global.front.(width,height) |
| TODO |
termbox2.h:2540 |
iswprint ch? |
| TODO |
termbox2.h:2662 |
\r, \t, \v, \f, etc? |
| TODO |
termbox2.h:2948 |
Reorder TB_CAP_* so more critical caps come first. |
| TODO |
termbox2.h:3497 |
Harden against errors encountered mid-resize |
| TODO |
termbox2.h:4048 |
iswprint ch? |