llmnpc - Markers

type	location	content
TODO	llama.cpp/.github/workflows/build.yml:1041	disabled for now, consider adding tests for all CPU variants instead
TODO	llama.cpp/.github/workflows/build.yml:1079	Remove GGML_CUDA_CUB_3DOT2 flag once CCCL 3.2 is bundled within CTK and that CTK version is used in this project
TODO	llama.cpp/.github/workflows/build.yml:1124	Remove GGML_CUDA_CUB_3DOT2 flag once CCCL 3.2 is bundled within CTK and that CTK version is used in this project
TODO	llama.cpp/.github/workflows/build.yml:1168	add ssl support ; we will also need to modify win-build-sycl.bat to accept user-specified args
FIXME	llama.cpp/.github/workflows/build.yml:1392	test on devices"
TODO	llama.cpp/.github/workflows/build.yml:1461	simplify the following workflows using a matrix
TODO	llama.cpp/.github/workflows/build.yml:1462	run lighter CI on PRs and the full CI only on master (if needed)
TODO	llama.cpp/.github/workflows/release.yml:400	Remove GGML_CUDA_CUB_3DOT2 flag once CCCL 3.2 is bundled within CTK and that CTK version is used in this project
TODO	llama.cpp/CMakeLists.txt:42	analyze performance impact, see https://spidermonkey.dev/blog/2025/01/15/is-memory64-actually-worth-using
TODO	llama.cpp/CONTRIBUTING.md:149	abbreviations usage)_
TODO	llama.cpp/CONTRIBUTING.md:153	add guidelines with examples and apply them to the codebase)_
TODO	llama.cpp/ci/run.sh:55	Remove GGML_CUDA_CUB_3DOT2 flag once CCCL 3.2 is bundled within CTK and that CTK version is used in this project
TODO	llama.cpp/ci/run.sh:334	this hangs for some reason ...
TODO	llama.cpp/common/CMakeLists.txt:113	use list(APPEND LLAMA_COMMON_EXTRA_LIBS ...)
TODO	llama.cpp/common/arg.cpp:178	detect this based on current console
TODO	llama.cpp/common/arg.cpp:667	maybe convert enum llama_example to string
TODO	llama.cpp/common/arg.cpp:867	support arg with 2 values
TODO	llama.cpp/common/chat-parser-xml-toolcall.cpp:506	Delete this when json_partial adds top-level support for null/true/false
TODO	llama.cpp/common/chat-parser-xml-toolcall.cpp:652	Note that form.allow_toolcall_in_think is not tested yet. If anyone confirms it works, this comment can be removed.
TODO	llama.cpp/common/chat-parser.cpp:1401	Tool calling
TODO	llama.cpp/common/chat-parser.h:22	rename to params
TODO	llama.cpp/common/chat.cpp:133	these can become expensive for long messages - how to optimize?
TODO	llama.cpp/common/chat.cpp:200	this is ugly, refactor it somehow
TODO	llama.cpp/common/chat.cpp:812	do we need to merge, or replacing is fine?
TODO	llama.cpp/common/chat.cpp:818	merge properly instead of overwriting (matching old behavior)
TODO	llama.cpp/common/chat.cpp:836	improve this later
TODO	llama.cpp/common/chat.cpp:2401	if (has_raw_python)
TODO	llama.cpp/common/chat.cpp:3228	support that mix in handlers below.
TODO	llama.cpp/common/chat.h:155	refactor this to "bool enable_thinking"
TODO	llama.cpp/common/chat.h:179	refactor this to "bool parse_reasoning"
TODO	llama.cpp/common/common.cpp:101	windows + arm64 + mingw64
TODO	llama.cpp/common/common.cpp:381	windows + arm64 + mingw64
TODO	llama.cpp/common/common.cpp:997	move to common/sampling
TODO	llama.cpp/common/common.cpp:1117	fix naming
TODO	llama.cpp/common/common.h:521	support threadpool)
TODO	llama.cpp/common/common.h:829	repace embd_norm with an enum
TODO	llama.cpp/common/console.cpp:1013	maybe support multiline history entries?
TODO	llama.cpp/common/download.cpp:360	maybe retry only on certain codes
TODO	llama.cpp/common/download.cpp:436	use actual GET status?
TODO	llama.cpp/common/download.cpp:747	cache the manifest response so that it appears in the model list
TODO	llama.cpp/common/download.cpp:848	get GGUF size, not manifest size
TODO	llama.cpp/common/jinja/lexer.cpp:213	handle lstrip/rstrip for comments? (not important for now)
FIXME	llama.cpp/common/jinja/parser.cpp:424	tests can also be expressed like this: if x is eq 3
TODO	llama.cpp/common/jinja/runtime.h:585	probably allow print value_none as "None" string? currently this breaks some templates
TODO	llama.cpp/common/jinja/value.cpp:289	make sure this is the same behavior as Python's strftime
FIXME	llama.cpp/common/jinja/value.cpp:575	Support non-specified delimiter (split on consecutive (no leading or trailing) whitespace)
FIXME	llama.cpp/common/jinja/value.cpp:599	Support non-specified delimiter (split on consecutive (no leading or trailing) whitespace)
FIXME	llama.cpp/common/jinja/value.cpp:916	sorting is currently always case sensitive
FIXME	llama.cpp/common/jinja/value.cpp:1027	sorting is currently always case sensitive
TODO	llama.cpp/common/jinja/value.cpp:1166	not sure if this is the right behavior
TODO	llama.cpp/common/jinja/value.cpp:1220	avoid circular references
TODO	llama.cpp/common/jinja/value.cpp:1307	avoid circular references
TODO	llama.cpp/common/jinja/value.h:156	C++20 <=> operator
TODO	llama.cpp/common/json-partial.cpp:311	handle more unclosed top-level primitive if the stack was empty but we got an error (e.g. "tru", "\"", etc...)
TODO	llama.cpp/common/json-partial.h:3	use json_fwd.hpp when possible
TODO	llama.cpp/common/json-schema-to-grammar.cpp:971	support minimum, maximum, exclusiveMinimum, exclusiveMaximum at least for zero
TODO	llama.cpp/common/peg-parser.cpp:1357	Implement more comprehensive grammar generation for raw strings.
TODO	llama.cpp/common/preset.cpp:85	maybe throw an error instead?
TODO	llama.cpp/common/preset.h:31	maybe implement to_env() if needed
TODO	llama.cpp/common/sampling.cpp:12	deduplicate with llama-impl.h
TODO	llama.cpp/common/sampling.cpp:397	measure grammar performance
TODO	llama.cpp/common/sampling.cpp:471	simplify
TODO	llama.cpp/common/sampling.cpp:617	compute this from the vocab
TODO	llama.cpp/common/sampling.h:32	measure grammar performance
TODO	llama.cpp/common/speculative.cpp:125	track performance of most recent calls
TODO	llama.cpp/common/speculative.cpp:171	optimize or pass from outside?
TODO	llama.cpp/common/speculative.cpp:452	implement
TODO	llama.cpp/common/speculative.cpp:735	noop
TODO	llama.cpp/convert_hf_to_gguf.py:555	why do we squeeze here?
TODO	llama.cpp/convert_hf_to_gguf.py:614	use Q4_K and Q6_K
TODO	llama.cpp/convert_hf_to_gguf.py:854	Handle "sliding_attention" similarly when models start implementing it
TODO	llama.cpp/convert_hf_to_gguf.py:966	should these be marked as UNUSED instead? (maybe not)
TODO	llama.cpp/convert_hf_to_gguf.py:2357	how to determine special FIM tokens automatically?
TODO	llama.cpp/convert_hf_to_gguf.py:2971	remove this once everyone has migrated to newer version of llama.cpp
TODO	llama.cpp/convert_hf_to_gguf.py:3184	multiply by the scale directly instead of inverting it twice
TODO	llama.cpp/convert_hf_to_gguf.py:5481	this is a hack, should be fixed
TODO	llama.cpp/convert_hf_to_gguf.py:6071	these special tokens should be exported only for the CodeGemma family
TODO	llama.cpp/convert_hf_to_gguf.py:6575	implement self.prediction_coefs.weight.clamp_(...)
TODO	llama.cpp/convert_hf_to_gguf.py:7073	does this really matter?
TODO	llama.cpp/convert_hf_to_gguf.py:7997	mimo v2 does not indicate the number of next-token-prediction layers, therefore we cannot do the same way as GLM4_MOE
TODO	llama.cpp/convert_hf_to_gguf.py:9315	Extend this if the prefix(es) need to be configurable
TODO	llama.cpp/convert_hf_to_gguf.py:9892	remove this once image support is implemented for Chameleon
TODO	llama.cpp/convert_hf_to_gguf.py:10471	remove once MXFP4 is supported more generally
TODO	llama.cpp/convert_hf_to_gguf.py:10941	remove this once everyone migrates to newer version of llama.cpp
TODO	llama.cpp/convert_hf_to_gguf.py:11526	uncomment U64, U32, and U16, ref: https://github.com/pytorch/pytorch/issues/58734
TODO	llama.cpp/convert_hf_to_gguf_update.py:55	generate tokenizer tests for llama.cpp
TODO	llama.cpp/convert_hf_to_gguf_update.py:81	this string has to exercise as much pre-tokenizer functionality as possible
TODO	llama.cpp/convert_hf_to_gguf_update.py:85	add models here, base models preferred
TODO	llama.cpp/convert_lora_to_gguf.py:64	add ellipsis in the type signature
TODO	llama.cpp/convert_lora_to_gguf.py:99	make sure this is correct
TODO	llama.cpp/convert_lora_to_gguf.py:167	support higher dimensional A shapes bigger than 1
TODO	llama.cpp/convert_lora_to_gguf.py:173	compose the above two
TODO	llama.cpp/examples/convert_legacy_llama.py:133	match this with `llama_ftype`
TODO	llama.cpp/examples/convert_legacy_llama.py:134	rename to LLAMAFileType
TODO	llama.cpp/examples/convert_legacy_llama.py:135	move to `gguf.py`
TODO	llama.cpp/examples/convert_legacy_llama.py:209	verify this
TODO	llama.cpp/examples/convert_legacy_llama.py:351	reuse (probably move to gguf.py?)
FIXME	llama.cpp/examples/convert_legacy_llama.py:1266	Respect --vocab-dir?
TODO	llama.cpp/examples/gguf-hash/deps/xxhash/xxhash.h:798	Update to correct value when its been specified.
TODO	llama.cpp/examples/gguf-hash/deps/xxhash/xxhash.h:3920	IBM XL */
FIXME	llama.cpp/examples/gguf-hash/deps/xxhash/xxhash.h:4670	Clang's output is still _much_ faster -- On an AMD Ryzen 3600,
TODO	llama.cpp/examples/json_schema_to_grammar.py:218	support "uri", "email" string formats
TODO	llama.cpp/examples/json_schema_to_grammar.py:694	support minimum, maximum, exclusiveMinimum, exclusiveMaximum at least for zero
TODO	llama.cpp/examples/parallel/parallel.cpp:507	print sampling/grammar timings for all clients
TODO	llama.cpp/examples/pydantic_models_to_grammar.py:20	fix this
TODO	llama.cpp/examples/retrieval/retrieval.cpp:8	remove me
TODO	llama.cpp/examples/speculative-simple/speculative-simple.cpp:51	simplify this logic
TODO	llama.cpp/examples/speculative/speculative.cpp:423	simplify
TODO	llama.cpp/examples/speculative/speculative.cpp:629	print sampling/grammar timings for all drafts
TODO	llama.cpp/ggml/CMakeLists.txt:90	mark all options as advanced when not GGML_STANDALONE
TODO	llama.cpp/ggml/include/ggml-metal.h:42	remove in the future
TODO	llama.cpp/ggml/include/ggml.h:190	support for clang
TODO	llama.cpp/ggml/include/ggml.h:249	convert to enum https://github.com/ggml-org/llama.cpp/pull/16187#discussion_r2388538726
TODO	llama.cpp/ggml/include/ggml.h:749	temporary until model loading of ggml examples is refactored
TODO	llama.cpp/ggml/include/ggml.h:1550	when we start computing gradient, make a copy instead of view
TODO	llama.cpp/ggml/include/ggml.h:1557	when we start computing gradient, make a copy instead of view
TODO	llama.cpp/ggml/include/ggml.h:1570	when we start computing gradient, make a copy instead of view
TODO	llama.cpp/ggml/include/ggml.h:1955	this is very likely wrong for some cases! - needs more testing
TODO	llama.cpp/ggml/include/ggml.h:2346	needs to be adapted to ggml_flash_attn_ext
TODO	llama.cpp/ggml/include/ggml.h:2459	currently only lower, right, non-unitriangular variant is implemented
TODO	llama.cpp/ggml/include/ggml.h:2723	currently, only a few functions are in the base ggml API, while the rest are in the CPU backend
TODO	llama.cpp/ggml/src/CMakeLists.txt:78	should not be set globally
TODO	llama.cpp/ggml/src/CMakeLists.txt:103	probably these flags need to be tweaked on some architectures
TODO	llama.cpp/ggml/src/ggml-alloc.c:738	better way to add external dependencies
FIXME	llama.cpp/ggml/src/ggml-backend-reg.cpp:163	backends cannot be safely unloaded without a function to destroy all the backend resources,
FIXME	llama.cpp/ggml/src/ggml-backend.cpp:182	add a generic callback to the buffer interface
FIXME	llama.cpp/ggml/src/ggml-backend.cpp:1199	count the number of inputs instead of only checking when full
TODO	llama.cpp/ggml/src/ggml-backend.cpp:1567	add public function to facilitate this, since applications do not have direct access to the backend interface
TODO	llama.cpp/ggml/src/ggml-backend.cpp:1609	pass backend to the callback, then the user can decide if they want to synchronize
FIXME	llama.cpp/ggml/src/ggml-backend.cpp:1658	needs to be size*2 to account for leafs (do it in graph_split instead)
TODO	llama.cpp/ggml/src/ggml-blas/ggml-blas.cpp:411	find the optimal value
TODO	llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:1073	performace is low.
TODO	llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:2264	check theta_scale_length and position_length.
TODO	llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:2341	acl_yarn_ramp_tensor use rope cache.
TODO	llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:2812	n_dims < ne0
TODO	llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:2839	ne0 != n_dims in mode2
TODO	llama.cpp/ggml/src/ggml-cann/aclnn_ops.h:883	If `ne12 > 1`, grouped multiplication and memory copying is used for efficiency.
TODO	llama.cpp/ggml/src/ggml-cann/common.h:619	each stream should have a memory pool.
TODO	llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:173	add more device info later.
TODO	llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:1104	cann backend doesn't support quantized yet. Just leave the code
TODO	llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:1208	need handle tensor which has paddings.
TODO	llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:1229	refer to cann(#6017), it use thread's default stream.
TODO	llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:1311	Support 310p P2P copy
TODO	llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:1438	quantized type?
TODO	llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2016	Support 310p P2P copy
TODO	llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2040	this event is not effective with acl graph mode, change to use aclrtSynchronizeStream
TODO	llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2096	support broadcast for ADD + RMS_NORM
TODO	llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2205	Optimize here. Currently, we can only
TODO	llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2354	support GGML_TYPE_BF16
TODO	llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2369	Support rope_dim < ne00(dim)
TODO	llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2441	add circular padding support for cann, see https://github.com/ggml-org/llama.cpp/pull/16985
TODO	llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2474	support bias != 0.0f
TODO	llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2476	support attention sinks [TAG_ATTN_SINKS]
TODO	llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2498	support attention sinks [TAG_ATTN_SINKS]
TODO	llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2507	padding to support
TODO	llama.cpp/ggml/src/ggml-common.h:1087	fix name to kvalues_iq4_nl
TODO	llama.cpp/ggml/src/ggml-cpu/CMakeLists.txt:503	Separation to determine activation of VX/VXE/VXE2
TODO	llama.cpp/ggml/src/ggml-cpu/amx/amx.cpp:152	not sure if correct (https://github.com/ggml-org/llama.cpp/pull/16315)
TODO	llama.cpp/ggml/src/ggml-cpu/amx/common.h:83	fix padding for vnni format
TODO	llama.cpp/ggml/src/ggml-cpu/amx/mmq.cpp:510	this is reference impl!
TODO	llama.cpp/ggml/src/ggml-cpu/amx/mmq.cpp:2426	performance improvement: merge quant A
TODO	llama.cpp/ggml/src/ggml-cpu/arch/wasm/quants.c:382	check if unrolling this is better
TODO	llama.cpp/ggml/src/ggml-cpu/arch/wasm/quants.c:475	check if unrolling this is better
FIXME	llama.cpp/ggml/src/ggml-cpu/arch/x86/cpu-feats.cpp:264	this does not check for OS support
TODO	llama.cpp/ggml/src/ggml-cpu/arch/x86/quants.c:1110	can _mm256_mulhi_epu16 be faster even if 16-bits?
TODO	llama.cpp/ggml/src/ggml-cpu/binary-ops.cpp:114	Use the 'traits' lookup table (for type conversion fns), instead of a mass of 'if' conditions with long templates
TODO	llama.cpp/ggml/src/ggml-cpu/ggml-cpu-impl.h:170	double-check these work correctly
TODO	llama.cpp/ggml/src/ggml-cpu/ggml-cpu-impl.h:521	move to ggml-threading
TODO	llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:123	add support for explicit memory order
TODO	llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:130	add support for explicit memory order
TODO	llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:137	add support for explicit memory order
TODO	llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:1206	this is a bit of a hack, we should probably have a better way to handle this
TODO	llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:1263	extract to "extra_op"
TODO	llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:1477	this is a bit of a hack, we should probably have a better way to handle this
TODO	llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:2159	Windows etc.
FIXME	llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:2280	get_rows can use additional threads, but the cost of launching additional threads
TODO	llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:2431	support > 64 CPUs
TODO	llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:2524	there seems to be no way to set lower prio on Apple platforms
TODO	llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:2547	this may not work on BSD, to be verified
TODO	llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:2893	this can become (n_tasks-1)
TODO	llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:2896	this can become (n_tasks-1)
TODO	llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:2899	this can become (n_tasks-1)
FIXME	llama.cpp/ggml/src/ggml-cpu/ggml-cpu.cpp:136	deep copy
TODO	llama.cpp/ggml/src/ggml-cpu/ggml-cpu.cpp:665	move to ggml-base
FIXME	llama.cpp/ggml/src/ggml-cpu/llamafile/sgemm.cpp:303	this should check for __ARM_FEATURE_FP16_VECTOR_ARITHMETIC
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:1708	support for transposed / permuted tensors
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:1712	maybe this is not optimal?
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:1752	support for transposed / permuted tensors
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:1756	maybe this is not optimal?
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:1797	templateify the implemenation and support for I64
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:1832	support for transposed / permuted tensors
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:1850	maybe this is not optimal?
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:1913	smarter multi-theading
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:1956	smarter multi-theading
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:1999	smarter multi-theading
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:2042	smarter multi-theading
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:3729	optimize
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:3798	optimize
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:3970	optimize
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:4070	optimize
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:4410	add x parameter to ggml_vec_scale_f32 and remove this memcpy
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:5086	handle transposed/permuted matrices
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:5165	handle transposed/permuted matrices
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:5253	is this supposed to be ceil instead of floor?
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:5378	handle transposed/permuted matrices
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:7713	optimize
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:8556	on ARM, native f16 should be faster
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:9214	transpose the output for smaller strides for big batches?
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:9331	maybe unroll more?
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:9419	what happens when (d_state % svcntw()) != 0?
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:9495	optimize / multi-thread
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:9562	optimize / multi-thread
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:10403	Write SVE code and RVV code
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:10658	handle transposed/permuted matrices
TODO	llama.cpp/ggml/src/ggml-cpu/ops.cpp:10756	handle transposed/permuted matrices
TODO	llama.cpp/ggml/src/ggml-cpu/quants.c:151	add WASM SIMD
TODO	llama.cpp/ggml/src/ggml-cpu/repack.cpp:2379	this branch seems wrong
TODO	llama.cpp/ggml/src/ggml-cpu/repack.cpp:2500	generalise.
TODO	llama.cpp/ggml/src/ggml-cpu/repack.cpp:2541	needs to be revisited
TODO	llama.cpp/ggml/src/ggml-cpu/repack.cpp:2805	General batched mul mat for 4D tensors
TODO	llama.cpp/ggml/src/ggml-cpu/simd-mappings.h:468	is this optimal ?
TODO	llama.cpp/ggml/src/ggml-cpu/simd-mappings.h:568	is this optimal ?
TODO	llama.cpp/ggml/src/ggml-cpu/simd-mappings.h:862	Does this work?
TODO	llama.cpp/ggml/src/ggml-cpu/simd-mappings.h:886	is this optimal ?
TODO	llama.cpp/ggml/src/ggml-cpu/simd-mappings.h:978	is this optimal ?
TODO	llama.cpp/ggml/src/ggml-cpu/unary-ops.cpp:135	Use the 'traits' lookup table (for type conversion fns), instead of a mass of 'if' conditions with long templates
TODO	llama.cpp/ggml/src/ggml-cpu/vec.cpp:475	optimize to process the remaining elements in groups using the smaller vector sizes from AVX2 and SSE
TODO	llama.cpp/ggml/src/ggml-cpu/vec.h:609	Write SVE code
TODO	llama.cpp/ggml/src/ggml-cpu/vec.h:672	Write SVE code
TODO	llama.cpp/ggml/src/ggml-cpu/vec.h:950	optimize performance
TODO	llama.cpp/ggml/src/ggml-cuda/CMakeLists.txt:60	Remove once CCCL 3.2 has been released and bundled with CUDA Toolkit
TODO	llama.cpp/ggml/src/ggml-hexagon/ggml-hexagon.cpp:190	might need to bail out if the HTP is stuck on something
TODO	llama.cpp/ggml/src/ggml-hexagon/ggml-hexagon.cpp:205	handle errors
TODO	llama.cpp/ggml/src/ggml-hexagon/ggml-hexagon.cpp:208	update profiling implementation, currently only works for opt_opsync mode
TODO	llama.cpp/ggml/src/ggml-hexagon/ggml-hexagon.cpp:1825	support broadcast for ne[2 and 3]
TODO	llama.cpp/ggml/src/ggml-hexagon/ggml-hexagon.cpp:1981	add support for non-contigiuos tensors
TODO	llama.cpp/ggml/src/ggml-hexagon/ggml-hexagon.cpp:2000	add support for non-contigiuos tensors
FIXME	llama.cpp/ggml/src/ggml-hexagon/ggml-hexagon.cpp:2047	add support for sinks
FIXME	llama.cpp/ggml/src/ggml-hexagon/ggml-hexagon.cpp:2166	add support for GGML_TYPE_F16 for src0
TODO	llama.cpp/ggml/src/ggml-hexagon/ggml-hexagon.cpp:2727	the current version might do incorrect reodering in cases where quantized src0
TODO	llama.cpp/ggml/src/ggml-hexagon/htp/hex-dma.h:32	technically we don't need these and could use Q6_dmstart/wait/etc instead
FIXME	llama.cpp/ggml/src/ggml-hexagon/htp/matmul-ops.c:930	might need to handle zero as a special case (see ggml-cpu code)
FIXME	llama.cpp/ggml/src/ggml-hexagon/htp/matmul-ops.c:962	might need to handle zero as a special case (see ggml-cpu code)
FIXME	llama.cpp/ggml/src/ggml-hexagon/htp/matmul-ops.c:1044	might need to handle zero as a special case (see ggml-cpu code)
FIXME	llama.cpp/ggml/src/ggml-hexagon/htp/matmul-ops.c:1085	might need to handle zero as a special case (see ggml-cpu code)
FIXME	llama.cpp/ggml/src/ggml-hexagon/htp/matmul-ops.c:1186	might need to handle zero as a special case (see ggml-cpu code)
FIXME	llama.cpp/ggml/src/ggml-hexagon/htp/matmul-ops.c:1241	might need to handle zero as a special case (see ggml-cpu code)
TODO	llama.cpp/ggml/src/ggml-hexagon/htp/rope-ops.c:334	use simd to speed up the remaining elements copy
TODO	llama.cpp/ggml/src/ggml-hip/CMakeLists.txt:86	do not use CUDA definitions for HIP
TODO	llama.cpp/ggml/src/ggml-impl.h:72	move to ggml.h? (won't be able to inline)
TODO	llama.cpp/ggml/src/ggml-impl.h:603	Consider allowing GGML_OP_NONE nodes in between
FIXME	llama.cpp/ggml/src/ggml-metal/CMakeLists.txt:103	only add to the ggml-metal target?
TODO	llama.cpp/ggml/src/ggml-metal/ggml-metal-impl.h:9	for optimal performance, become function of the device and work size
TODO	llama.cpp/ggml/src/ggml-metal/ggml-metal-ops.cpp:56	this can be removed when the allocator starts filtering them earlier
TODO	llama.cpp/ggml/src/ggml-metal/ggml-metal-ops.cpp:632	make a simpler cpy_bytes kernel
TODO	llama.cpp/ggml/src/ggml-metal/ggml-metal-ops.cpp:1629	relax this constraint in the future
TODO	llama.cpp/ggml/src/ggml-metal/ggml-metal-ops.cpp:1816	helper function
TODO	llama.cpp/ggml/src/ggml-metal/ggml-metal-ops.cpp:1836	determine the optimal parameters based on grid utilization
TODO	llama.cpp/ggml/src/ggml-musa/CMakeLists.txt:73	do not use CUDA definitions for MUSA
TODO	llama.cpp/ggml/src/ggml-musa/CMakeLists.txt:107	mudnn has not provided static libraries yet
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:2918	initialize them for non SMALL_PATH path, or remove them.
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:3268	add support
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:3270	implement BF16, Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, IQ4_NL support (https://github.com/ggml-org/llama.cpp/pull/14661)")
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:3374	add circular padding support for opencl, see https://github.com/ggml-org/llama.cpp/pull/16985
FIXME	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:3771	if any unexpected results are seen, double check the offset -
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:3916	use preallocated images instead of sub-buffer then image
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:5146	find the optimal values for these
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:8515	remove duplicate definitions of image description + format -- move to top
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9052	Unknown GPU");
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9091	add block_q4_0 variant.
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9110	Unknown GPU");
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9147	Unknown GPU");
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9209	Unknown GPU");
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9245	Unknown GPU");
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9282	Unknown GPU");
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9319	Unknown GPU");
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9358	Unknown GPU");
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9396	Unknown GPU");
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9428	Unknown GPU");
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9466	Unknown GPU");
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9499	Unknown GPU");
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9655	Unknown GPU");
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9699	Unknown GPU");
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9735	Unknown GPU");
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9879	Unknown GPU");
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:9918	Unknown GPU");
TODO	llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:10258	Unknown GPU");
TODO	llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp:481	currently the output_size is always known, do we need support for commands with variable output size?
TODO	llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp:789	cache the alloc responses to avoid extra RPC calls?
TODO	llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp:1932	obtain value from the server
TODO	llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp:1970	call the remote backend and cache the results
TODO	llama.cpp/ggml/src/ggml-sycl/common.hpp:84	adapt to hardwares
TODO	llama.cpp/ggml/src/ggml-sycl/common.hpp:87	currently, it's not used for XMX really.
TODO	llama.cpp/ggml/src/ggml-sycl/convert.cpp:517	Downsample logic is separated from the kernel, a rewrite is desirable
TODO	llama.cpp/ggml/src/ggml-sycl/getrows.cpp:180	Refactor and remove duplicates */
TODO	llama.cpp/ggml/src/ggml-sycl/getrows.cpp:211	k-quants
FIXME	llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:863	do not crash if SYCL Buffer alloc fails
FIXME	llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:1118	this is not thread safe
FIXME	llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:1187	this is a hack to avoid having to implement a new buffer type
TODO	llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:1202	return device.maxBufferLength
TODO	llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:2561	check that src0->buffer->buft is a split buffer type, replace GGML_BACKEND_TYPE_GPU_SPLIT check
TODO	llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:2965	see https://github.com/ggml-org/llama.cpp/pull/13155
TODO	llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:3211	accuracy issues in MMQ
TODO	llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:3536	Refactor and cleanup of mul mat dispatching.
TODO	llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:3913	more efficient implementation
TODO	llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:4459	update for the new
TODO	llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:4641	support GGML_TYPE_BF16
FIXME	llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:4642	keep a list of supported types to avoid breaking the backend when a new type is added
TODO	llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:4646	The configuration below needs more work to be supported with oneDNN
TODO	llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:4652	This specific configuration can fail with oneDNN and needs more debugging
TODO	llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:4846	add circular padding support for syscl, see https://github.com/ggml-org/llama.cpp/pull/16985
TODO	llama.cpp/ggml/src/ggml-sycl/softmax.cpp:67	noncontigous inputs/outputs
TODO	llama.cpp/ggml/src/ggml-sycl/sycl_hw.cpp:3	currently not used
TODO	llama.cpp/ggml/src/ggml-sycl/sycl_hw.hpp:13	currently not used
TODO	llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:3167	We're no longer benefitting from the async compiles (shaders are
TODO	llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:5258	Use pointer or reference to avoid copy
TODO	llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:6520	staging_offset is not used
TODO	llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:14371	enable async and synchronize
TODO	llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:486	error handling
TODO	llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:723	handle multiple pipeline names
TODO	llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:2361	optional, needed?
TODO	llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:2365	optional, implement this
TODO	llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:2367	optional, think it coordinates with .init_tensor
TODO	llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:2453	for now, return maxBufferSize as both free and total memory
TODO	llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:2868	track need for these toggles: https://issues.chromium.org/issues/42251215
TODO	llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:2965	Maybe WebGPU needs a "fast" mode where you can request compilers skip adding checks like these,
TODO	llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:3150	support non-contiguous tensors, e.g. for MOE_EXPERT_REDUCE
TODO	llama.cpp/ggml/src/ggml-zdnn/ggml-zdnn.cpp:22	implement support for quantized types
TODO	llama.cpp/ggml/src/ggml-zdnn/ggml-zdnn.cpp:609	make thread-safe
TODO	llama.cpp/ggml/src/ggml-zdnn/mmf.cpp:70	Remove in the future as we are currently DLF16 -> FP32 then in the next op, FP32 -> DLF16 again. Inefficient.
TODO	llama.cpp/ggml/src/ggml-zdnn/utils.cpp:71	Consider adding a ggml check.
TODO	llama.cpp/ggml/src/ggml-zdnn/utils.cpp:72	If tensor = 4D, use ZDNN_NCHW by default.
TODO	llama.cpp/ggml/src/ggml-zdnn/utils.cpp:73	If tensor = 2D, use ZDNN_NHWC by default.
FIXME	llama.cpp/ggml/src/ggml.c:10	required here for quantization functions
TODO	llama.cpp/ggml/src/ggml.c:1723	this should not be needed as long as we don't rely on aligned SIMD loads
TODO	llama.cpp/ggml/src/ggml.c:1992	support less-strict constraint
TODO	llama.cpp/ggml/src/ggml.c:3788	implement non F32 return
TODO	llama.cpp/ggml/src/ggml.c:3812	implement non F32 return
TODO	llama.cpp/ggml/src/ggml.c:4320	when implement backward, fix this:
TODO	llama.cpp/ggml/src/ggml.c:4923	implement antialias for modes other than bilinear
TODO	llama.cpp/ggml/src/ggml.c:5264	check if vT can be multiplied by (k*qT)
TODO	llama.cpp/ggml/src/ggml.c:5341	adapt to ggml_flash_attn_ext() changes");
TODO	llama.cpp/ggml/src/ggml.c:5344	check if vT can be multiplied by (k*qT)
TODO	llama.cpp/ggml/src/ggml.c:5417	maybe support other strides than 1?
TODO	llama.cpp/ggml/src/ggml.c:6093	support other variants
TODO	llama.cpp/ggml/src/ggml.c:6287	should probably be sum instead of mean
TODO	llama.cpp/ggml/src/ggml.c:6799	this branch isn't accessible anymore, maybe move this to ggml_build_forward_expand
FIXME	llama.cpp/ggml/src/ggml.c:7421	use ggml-backend to obtain the tensor data
TODO	llama.cpp/gguf-py/gguf/constants.py:3666	add GGMLFileType from ggml_ftype in ggml.h
TODO	llama.cpp/gguf-py/gguf/constants.py:3746	need help with 64-bit types in Python
FIXME	llama.cpp/gguf-py/gguf/gguf_reader.py:73	When/if _get_field_parts() support multi-dimensional arrays, this must do so too
TODO	llama.cpp/gguf-py/gguf/gguf_reader.py:205	add option to generate error on duplicate keys
FIXME	llama.cpp/gguf-py/gguf/gguf_reader.py:243	Handle multi-dimensional arrays properly instead of flattening
TODO	llama.cpp/gguf-py/gguf/gguf_writer.py:425	cleaner way to get the first key
TODO	llama.cpp/gguf-py/gguf/lazy.py:49	make this even more comprehensive
TODO	llama.cpp/gguf-py/gguf/lazy.py:101	dict and set
TODO	llama.cpp/gguf-py/gguf/lazy.py:122	maybe handle tensors in kwargs too
TODO	llama.cpp/gguf-py/gguf/lazy.py:228	__array_function__
TODO	llama.cpp/gguf-py/gguf/metadata.py:72	load adapter_config.json when possible, it usually contains the base model of the LoRA adapter
TODO	llama.cpp/gguf-py/gguf/metadata.py:325	should word-based size labels always be removed instead?
TODO	llama.cpp/gguf-py/gguf/metadata.py:354	should the basename version always be excluded?
TODO	llama.cpp/gguf-py/gguf/tensor_mapping.py:1210	these do not belong to block_mappings_cfg - move them to mappings_cfg
TODO	llama.cpp/gguf-py/gguf/utility.py:87	handle request errors (maybe with limited retries?)
TODO	llama.cpp/gguf-py/gguf/vocab.py:163	internally store as the new format instead of converting to old
FIXME	llama.cpp/gguf-py/gguf/vocab.py:369	Verify that added tokens here _cannot_ overlap with the main vocab.
TODO	llama.cpp/gguf-py/tests/test_metadata.py:110	hf suffix which could be ignored but isn't
TODO	llama.cpp/gguf-py/tests/test_metadata.py:142	DPO in the name
TODO	llama.cpp/gguf-py/tests/test_metadata.py:151	should "base" be a 'finetune' or 'size_label'?
TODO	llama.cpp/gguf-py/tests/test_quants.py:107	is a column-wise sum of squares appropriate?
TODO	llama.cpp/include/llama.h:57	show sample usage
TODO	llama.cpp/include/llama.h:90	remove, required until per token attributes are available from GGUF file
TODO	llama.cpp/include/llama.h:197	simplify (https://github.com/ggml-org/llama.cpp/pull/9294#pullrequestreview-2286561979)
TODO	llama.cpp/include/llama.h:205	consider SoA
TODO	llama.cpp/include/llama.h:239	rename this to "output"
TODO	llama.cpp/include/llama.h:417	update API to start accepting pointers to params structs (https://github.com/ggml-org/llama.cpp/discussions/9172)
TODO	llama.cpp/include/llama.h:532	rename to llama_get_pooling_type
TODO	llama.cpp/include/llama.h:955	rename to avoid confusion with llama_get_embeddings()
TODO	llama.cpp/include/llama.h:979	deprecate in favor of llama_get_logits_ith() (ref: https://github.com/ggml-org/llama.cpp/pull/14853#issuecomment-3113143522)
TODO	llama.cpp/include/llama.h:994	deprecate in favor of llama_get_embeddings_ith() (ref: https://github.com/ggml-org/llama.cpp/pull/14853#issuecomment-3113143522)
TODO	llama.cpp/include/llama.h:1469	extend in the future
TODO	llama.cpp/scripts/check-requirements.sh:172	the check is failing for some reason:
TODO	llama.cpp/src/llama-adapter.cpp:288	add support for norm vector
TODO	llama.cpp/src/llama-adapter.cpp:297	a more general solution for non-CPU extra buft should be imlpemented in the future
TODO	llama.cpp/src/llama-adapter.h:11	pimpl
TODO	llama.cpp/src/llama-batch.h:32	whole_seqs for embeddings?
TODO	llama.cpp/src/llama-batch.h:113	support embeddings if needed in the future
TODO	llama.cpp/src/llama-batch.h:131	this is more of a temporary solution until we have a better way to handle multiple positions per token/embd
TODO	llama.cpp/src/llama-context.cpp:105	start reading the actual value of mscale and handle the case where it is not 1.0f
TODO	llama.cpp/src/llama-context.cpp:305	move these checks to ggml_backend_sched
TODO	llama.cpp/src/llama-context.cpp:320	should we ignore ACCEL types too?
TODO	llama.cpp/src/llama-context.cpp:436	instead of the tensor names, use a map to keep track of which (FA) tensors belong to which layer
FIXME	llama.cpp/src/llama-context.cpp:444	fa_device_mismatch logic is wrong for --no-kv-offload, but this is broken anyways
TODO	llama.cpp/src/llama-context.cpp:500	not sure if the following graph would be worster case for multi-stream KV caches:
FIXME	llama.cpp/src/llama-context.cpp:548	if multiple single tokens are evaluated without a synchronization,
TODO	llama.cpp/src/llama-context.cpp:645	change the mctx->apply() to return information if a graph reserve is needed
TODO	llama.cpp/src/llama-context.cpp:722	use output_resolve_row()
TODO	llama.cpp/src/llama-context.cpp:773	use output_resolve_row()
TODO	llama.cpp/src/llama-context.cpp:987	not sure yet if we want to reserve here
TODO	llama.cpp/src/llama-context.cpp:1112	should we reserve?
TODO	llama.cpp/src/llama-context.cpp:1203	add new split mode where we pad the input sequences so that ubatch.equal_seqs == true
TODO	llama.cpp/src/llama-context.cpp:1213	this clear of the buffer can easily be forgotten - need something better
TODO	llama.cpp/src/llama-context.cpp:1235	this is a tmp solution until we have a proper way to support enc-dec models
TODO	llama.cpp/src/llama-context.cpp:1317	hacky solution
TODO	llama.cpp/src/llama-context.cpp:1493	avoid this workaround in the future
TODO	llama.cpp/src/llama-context.cpp:1543	this clear of the buffer can easily be forgotten - need something better
TODO	llama.cpp/src/llama-context.cpp:1783	is there something more efficient which also minimizes swaps?
TODO	llama.cpp/src/llama-context.cpp:1833	hacky enc-dec support
TODO	llama.cpp/src/llama-context.cpp:1864	also consider shrinking the buffer
TODO	llama.cpp/src/llama-context.cpp:1873	not needed?
TODO	llama.cpp/src/llama-context.cpp:2039	not sure if needed, might simplify in the future by removing this
FIXME	llama.cpp/src/llama-context.cpp:2144	fix in ggml_backend_sched
TODO	llama.cpp/src/llama-context.cpp:2491	add more model-specific info which should prevent loading the session file if not identical
TODO	llama.cpp/src/llama-context.cpp:2549	handle sampling buffers and samplers state ?
TODO	llama.cpp/src/llama-context.cpp:2574	add more info which needs to be identical but which is not verified otherwise
TODO	llama.cpp/src/llama-context.cpp:2638	handle sampling buffers and samplers state ?
TODO	llama.cpp/src/llama-context.cpp:2831	handle this error");
TODO	llama.cpp/src/llama-context.cpp:2948	better default
TODO	llama.cpp/src/llama-context.h:188	more flexible combinations of logical/physical batch size and context size
TODO	llama.cpp/src/llama-context.h:251	read/write lora adapters and cvec
TODO	llama.cpp/src/llama-context.h:268	tmp for handling cross-attention - need something better probably
TODO	llama.cpp/src/llama-grammar.h:71	remove, needed for tests atm
TODO	llama.cpp/src/llama-grammar.h:133	shared ptr
TODO	llama.cpp/src/llama-grammar.h:178	move the API below as member functions of llama_grammar
TODO	llama.cpp/src/llama-graph.cpp:99	use ubatch->n_seqs instead of failing
TODO	llama.cpp/src/llama-graph.cpp:404	need to move this to the unified cache and check there
TODO	llama.cpp/src/llama-graph.cpp:453	need to move this to the unified cache and check there
TODO	llama.cpp/src/llama-graph.cpp:456	need to move this to the unified cache and check there
TODO	llama.cpp/src/llama-graph.cpp:474	use ubatch->n_seqs instead of failing
TODO	llama.cpp/src/llama-graph.cpp:522	need to move this to the unified cache and check there
TODO	llama.cpp/src/llama-graph.cpp:538	Hybrid input classes are a bit redundant.
TODO	llama.cpp/src/llama-graph.cpp:626	need to move this to the unified cache and check there
TODO	llama.cpp/src/llama-graph.cpp:635	need to move this to the unified cache and check there
TODO	llama.cpp/src/llama-graph.cpp:1266	Use scalar div instead when/if implemented
TODO	llama.cpp/src/llama-graph.cpp:1376	move to hparams?
TODO	llama.cpp/src/llama-graph.cpp:1392	add support for gated squared relu
TODO	llama.cpp/src/llama-graph.cpp:1612	needs more work to be correct, for now just use the tensor shape
TODO	llama.cpp/src/llama-graph.cpp:1856	if ubatch.equal_seqs() == true, we can split the three tensors below into ubatch.n_seqs_unq streams
TODO	llama.cpp/src/llama-graph.cpp:1927	remove
TODO	llama.cpp/src/llama-graph.cpp:2183	maybe separate the inner implementation into a separate function
TODO	llama.cpp/src/llama-graph.cpp:2588	Call llama_sampler_accept_ggml after all samplers have been applied.
TODO	llama.cpp/src/llama-graph.h:58	tmp - need something better to pass the data from the encoder to the decoder
TODO	llama.cpp/src/llama-graph.h:61	this needs more work to be correct, for now copy the embeddings data to host memory
TODO	llama.cpp/src/llama-graph.h:738	needed by build_attn_mha, figure out a way to remove?
TODO	llama.cpp/src/llama-graph.h:897	remove
TODO	llama.cpp/src/llama-graph.h:951	move this implementation to llama_memory_recurrent.
TODO	llama.cpp/src/llama-graph.h:1020	better name
TODO	llama.cpp/src/llama-hparams.cpp:149	maybe support other convolution strides than 1
TODO	llama.cpp/src/llama-hparams.h:292	think of a better place for this function
TODO	llama.cpp/src/llama-hparams.h:293	pack the SWA params in a struct?
TODO	llama.cpp/src/llama-impl.h:64	rename to llama_format ?
TODO	llama.cpp/src/llama-kv-cache-iswa.cpp:206	if we fail again, we should attempt different splitting strategies
TODO	llama.cpp/src/llama-kv-cache.cpp:1086	add ggml helper function for this?
TODO	llama.cpp/src/llama-kv-cache.cpp:1485	support multiple streams");
TODO	llama.cpp/src/llama-kv-cache.cpp:1489	use ubatch->n_seqs instead of failing
TODO	llama.cpp/src/llama-kv-cache.cpp:1760	we also need to save llama_kv_cell_ext when apply_ubatch() support loading it
TODO	llama.cpp/src/llama-kv-cache.cpp:1912	we cannot yet restore llama_kv_cell_ext as the apply_ubatch() does not support it yet
TODO	llama.cpp/src/llama-kv-cells.h:31	add unit tests
TODO	llama.cpp/src/llama-memory-hybrid-iswa.cpp:76	non-sequential equal split can be done if using unified KV cache
TODO	llama.cpp/src/llama-memory-hybrid-iswa.cpp:95	will the recurrent cache be in an undefined context at this point?
TODO	llama.cpp/src/llama-memory-hybrid.cpp:76	non-sequential equal split can be done if using unified KV cache
TODO	llama.cpp/src/llama-memory-hybrid.cpp:95	will the recurrent cache be in an undefined context at this point?
TODO	llama.cpp/src/llama-memory-recurrent.cpp:390	non-sequential equal split can be done if using unified KV cache
TODO	llama.cpp/src/llama-memory-recurrent.cpp:430	optimize
TODO	llama.cpp/src/llama-memory-recurrent.cpp:482	would it be possible to resize the cache instead?
TODO	llama.cpp/src/llama-memory-recurrent.cpp:623	bake-in src refcounts in the cell metadata
TODO	llama.cpp/src/llama-memory-recurrent.cpp:931	llama_memory_recurrent should have a notion of max sequences
TODO	llama.cpp/src/llama-memory-recurrent.h:15	extract the cache state used for graph computation into llama_memory_recurrent_context_i
TODO	llama.cpp/src/llama-memory-recurrent.h:78	optimize for recurrent state needs
TODO	llama.cpp/src/llama-memory-recurrent.h:178	extract all the state like `head` and `n` here
TODO	llama.cpp/src/llama-mmap.cpp:43	consider moving to llama-impl.h if needed in more places
TODO	llama.cpp/src/llama-model-loader.cpp:496	this is not very clever - figure out something better
TODO	llama.cpp/src/llama-model-loader.cpp:659	make optional
TODO	llama.cpp/src/llama-model-saver.cpp:202	implement split file support
TODO	llama.cpp/src/llama-model-saver.cpp:247	implement LoRA support
TODO	llama.cpp/src/llama-model.cpp:593	Handle SWA metadata similarly when models start implementing it
TODO	llama.cpp/src/llama-model.cpp:853	become GGUF KV parameter
TODO	llama.cpp/src/llama-model.cpp:876	become GGUF KV parameter
TODO	llama.cpp/src/llama-model.cpp:995	become GGUF KV parameter
TODO	llama.cpp/src/llama-model.cpp:1200	fix conversion scripts to correctly populate `n_swa` and `n_swa_pattern`
TODO	llama.cpp/src/llama-model.cpp:1515	Jamba layers are a bit heterogenous, so naming this is hard.
TODO	llama.cpp/src/llama-model.cpp:1815	when MTP is implemented, this should probably be updated if needed
TODO	llama.cpp/src/llama-model.cpp:1883	add variants */
TODO	llama.cpp/src/llama-model.cpp:2157	when MTP is implemented, this should probably be updated if needed
TODO	llama.cpp/src/llama-model.cpp:2488	maybe add n_attn_temp_floor_scale as a separate KV?
TODO	llama.cpp/src/llama-model.cpp:2893	move to a separate function
FIXME	llama.cpp/src/llama-model.cpp:7464	workaround for CPU backend buft having a NULL device
TODO	llama.cpp/src/llama-model.cpp:8572	move reranking logic here and generalize
TODO	llama.cpp/src/llama-model.h:546	move this to new llm_arch_model_i interface
TODO	llama.cpp/src/llama-model.h:549	move this to new llm_arch_model_i interface
TODO	llama.cpp/src/llama-model.h:562	remove
TODO	llama.cpp/src/llama-quant.cpp:181	avoid hardcoded tensor names - use the TN_* constants
TODO	llama.cpp/src/llama-quant.cpp:313	explore better strategies
TODO	llama.cpp/src/llama-quant.cpp:320	explore better strategies
TODO	llama.cpp/src/llama-quant.cpp:589	use LLM_KV
TODO	llama.cpp/src/llama-quant.cpp:590	use LLM_KV
TODO	llama.cpp/src/llama-quant.cpp:654	avoid hardcoded tensor names - use the TN_* constants
TODO	llama.cpp/src/llama-quant.cpp:867	use a symmetric type instead
TODO	llama.cpp/src/llama-quant.cpp:985	temporary sanity check that the F16 -> MXFP4 is lossless
TODO	llama.cpp/src/llama-sampler.cpp:2548	remove trigger_words support.
TODO	llama.cpp/src/llama-vocab.cpp:246	there are a lot of common parts between spm and bpe tokenizers, should be refactored and reused
TODO	llama.cpp/src/llama-vocab.cpp:730	reduce string copies by using cpts_offs array
TODO	llama.cpp/src/llama-vocab.cpp:1578	should we set all of these to LLAMA_TOKEN_NULL?
TODO	llama.cpp/src/llama-vocab.cpp:2131	remove, required until per token attributes are available from GGUF file
TODO	llama.cpp/src/llama-vocab.cpp:2230	convert scripts should provide these tokens through the KV metadata LLM_KV_TOKENIZER_...
TODO	llama.cpp/src/llama-vocab.cpp:2497	workaround for o200k_harmony and solar-open tokenizer: the "<\|end\|>" token should not be EOG
TODO	llama.cpp/src/llama-vocab.cpp:2574	Extract attributes from GGUF file.
TODO	llama.cpp/src/llama-vocab.cpp:3271	where do these characters come from?
FIXME	llama.cpp/src/models/bitnet.cpp:153	do not use model.tok_embd directly, duplicate as model.output
TODO	llama.cpp/src/models/chameleon.cpp:161	this suppresses the output of image tokens, which is required to enable text-only outputs.
TODO	llama.cpp/src/models/gemma3.cpp:19	is causal == true correct? might need some changes
TODO	llama.cpp/src/models/gemma3n-iswa.cpp:22	is causal == true correct? might need some changes
TODO	llama.cpp/src/models/gemma3n-iswa.cpp:209	move this to right after the last KV layer
TODO	llama.cpp/src/models/gemma3n-iswa.cpp:261	verify if this is the correct behavior in transformers implementation
TODO	llama.cpp/src/models/graph-context-mamba.cpp:131	skip computing output earlier for unused tokens
TODO	llama.cpp/src/models/graph-context-mamba.cpp:244	use semistructured matrices to implement state-space duality
TODO	llama.cpp/src/models/graph-context-mamba.cpp:260	skip computing output earlier for unused tokens
TODO	llama.cpp/src/models/grovemoe.cpp:100	Only do the expert selection and weights once
TODO	llama.cpp/src/models/kimi-linear.cpp:428	can this ever be false?
TODO	llama.cpp/src/models/minicpm3.cpp:4	if the model varies, these parameters need to be read from the model
TODO	llama.cpp/src/models/minicpm3.cpp:145	is this correct?
TODO	llama.cpp/src/models/models.h:6	remove in follow-up PR - move to .cpp files
TODO	llama.cpp/src/unicode.h:7	reimplement this structure in endian-independent way
TODO	llama.cpp/tests/CMakeLists.txt:156	disabled on loongarch64 because the ggml-ci node lacks Python 3.8
TODO	llama.cpp/tests/CMakeLists.txt:171	disabled due to slowness
TODO	llama.cpp/tests/CMakeLists.txt:232	repair known memory leaks
TODO	llama.cpp/tests/test-backend-ops.cpp:2289	Make a template or something
TODO	llama.cpp/tests/test-backend-ops.cpp:3132	implement
TODO	llama.cpp/tests/test-backend-ops.cpp:4621	add test with a non-contiguous view as input ; this case is needed for build_rope_2d in clip.cpp
TODO	llama.cpp/tests/test-backend-ops.cpp:6145	this branch should become a separate test case parameter instead of hardcoding this for these head shapes
TODO	llama.cpp/tests/test-backend-ops.cpp:6965	implement for all backends
TODO	llama.cpp/tests/test-backend-ops.cpp:6977	or "other"
TODO	llama.cpp/tests/test-backend-ops.cpp:6988	implement for all backends
TODO	llama.cpp/tests/test-backend-ops.cpp:7486	add after WebGPU is fixed
TODO	llama.cpp/tests/test-backend-ops.cpp:8908	better value for n_threads
TODO	llama.cpp/tests/test-backend-sampler.cpp:734	biasing too much here makes the Vulkan sampling fail - should be investigated further
TODO	llama.cpp/tests/test-chat-template.cpp:625	llama_chat_format_single will be deprecated, remove these tests later
TODO	llama.cpp/tests/test-chat.cpp:121	extract to common helper (copied from test-grammar-integration.cpp)
TODO	llama.cpp/tests/test-grammar-integration.cpp:1414	The following line should fail, but currently it passes. `exclusiveMinimum` is not supported, as it would likely be too difficult to implement.
TODO	llama.cpp/tests/test-grammar-integration.cpp:1421	The following line should fail, but currently it passes. `uniqueItems` is not supported, as it would likely be too difficult to implement.
TODO	llama.cpp/tests/test-grammar-llguidance.cpp:1083	The following line should fail, but currently it passes. `uniqueItems` is not supported, as it would likely be too difficult to implement.
TODO	llama.cpp/tests/test-grammar-parser.cpp:7	shold not include libllama sources
TODO	llama.cpp/tests/test-json-partial.cpp:153	detect the true/false/null literal was complete
FIXME	llama.cpp/tests/test-quantize-fns.cpp:63	why is done twice?
TODO	llama.cpp/tests/test-regex-partial.cpp:265	((?:b)?a+). ??
TODO	llama.cpp/tools/cli/cli.cpp:68	show progress
TODO	llama.cpp/tools/cli/cli.cpp:75	reduce some copies here in the future
TODO	llama.cpp/tools/cli/cli.cpp:152	support remote files in the future (http, https, etc)
TODO	llama.cpp/tools/cli/cli.cpp:198	maybe support it later?
TODO	llama.cpp/tools/cli/cli.cpp:212	avoid using atexit() here by making `console` a singleton
TODO	llama.cpp/tools/completion/completion.cpp:916	one inconvenient of current chat template implementation is that we can't distinguish between user input and special tokens (prefix/postfix)
TODO	llama.cpp/tools/cvector-generator/cvector-generator.cpp:211	get rid of malloc if possible
TODO	llama.cpp/tools/cvector-generator/cvector-generator.cpp:241	get rid of this malloc if possible
TODO	llama.cpp/tools/cvector-generator/cvector-generator.cpp:287	customize padding token
TODO	llama.cpp/tools/cvector-generator/pca.hpp:72	enable Metal support when support for GGML_OP_SQRT is added
TODO	llama.cpp/tools/cvector-generator/pca.hpp:139	buf_size must be able to scale with params.n_batch
TODO	llama.cpp/tools/export-lora/export-lora.cpp:193	remove this when we can support merging subset of adapters. Ref: https://github.com/ggml-org/llama.cpp/pull/8607#discussion_r1686027777
TODO	llama.cpp/tools/export-lora/export-lora.cpp:303	add support for quantized lora
TODO	llama.cpp/tools/gguf-split/gguf-split.cpp:350	detect OS and use copy_file_range() here for better performance
TODO	llama.cpp/tools/imatrix/imatrix.cpp:678	extract into its own method; this is also used by the GGUF-based format
TODO	llama.cpp/tools/imatrix/imatrix.cpp:814	extract into its own method; this is also used by the legacy format
TODO	llama.cpp/tools/imatrix/imatrix.cpp:1006	only get outputs when (params.process_output \|\| params.compute_ppl)
TODO	llama.cpp/tools/mtmd/clip-graph.h:100	there was a more efficient which relies on ggml_view and ggml_rope_ext_inplace, but the rope inplace does not work well with non-contiguous tensors ; we should fix that and revert back to the original implementation in https://github.com/ggml-org/llama.cpp/pull/13065
TODO	llama.cpp/tools/mtmd/clip-impl.h:204	improve this later
TODO	llama.cpp/tools/mtmd/clip-model.h:99	support warmup size for custom token numbers
TODO	llama.cpp/tools/mtmd/clip-model.h:239	rename it to fc (fully connected layer)
TODO	llama.cpp/tools/mtmd/clip.cpp:345	q/k norm requires row size == n_embd, while here it's d_head
TODO	llama.cpp/tools/mtmd/clip.cpp:646	there was a more efficient which relies on ggml_view and ggml_rope_ext_inplace, but the rope inplace does not work well with non-contiguous tensors ; we should fix that and revert back to the original implementation in https://github.com/ggml-org/llama.cpp/pull/13065
TODO	llama.cpp/tools/mtmd/clip.cpp:1131	verify the image_min_tokens
TODO	llama.cpp/tools/mtmd/clip.cpp:1142	check kimivl preprocessor for exact values
TODO	llama.cpp/tools/mtmd/clip.cpp:1464	this is a hack to support Yi-type llava
TODO	llama.cpp/tools/mtmd/clip.cpp:2143	we don't support audio for Gemma 3N, but GGUF contains audio tensors
TODO	llama.cpp/tools/mtmd/clip.cpp:2269	define the behavior for add_padding = false
TODO	llama.cpp/tools/mtmd/clip.cpp:2631	this is only used by minicpmv, maybe remove it
TODO	llama.cpp/tools/mtmd/clip.cpp:3994	remove this function
TODO	llama.cpp/tools/mtmd/clip.cpp:4002	remove this function
TODO	llama.cpp/tools/mtmd/clip.h:61	should be enum, not string
TODO	llama.cpp/tools/mtmd/mtmd-audio.cpp:381	Handle short audio differently or return error
TODO	llama.cpp/tools/mtmd/mtmd-audio.cpp:400	probably unnecessary here? (or better doing it in g_cache?)
TODO	llama.cpp/tools/mtmd/mtmd-audio.cpp:412	handle these checks better
TODO	llama.cpp/tools/mtmd/mtmd-audio.cpp:520	maybe handle this better
TODO	llama.cpp/tools/mtmd/mtmd-cli.cpp:84	support for --system-prompt with /clear command
TODO	llama.cpp/tools/mtmd/mtmd.cpp:702	maybe support batching, but this may come with memory cost
TODO	llama.cpp/tools/mtmd/mtmd.h:187	deprecate
TODO	llama.cpp/tools/mtmd/mtmd.h:190	deprecate
TODO	llama.cpp/tools/mtmd/mtmd.h:192	deprecate
TODO	llama.cpp/tools/mtmd/mtmd.h:217	deprecate
TODO	llama.cpp/tools/perplexity/perplexity.cpp:869	this could be made smaller; it's currently the worst-case size
TODO	llama.cpp/tools/perplexity/perplexity.cpp:905	don't evaluate the last token of each sequence
TODO	llama.cpp/tools/perplexity/perplexity.cpp:1145	the last token of each of the sequences don't need to be evaluated
TODO	llama.cpp/tools/perplexity/perplexity.cpp:1167	this could be made smaller; it's currently the worst-case size
TODO	llama.cpp/tools/perplexity/perplexity.cpp:1199	end before the last token, no need to predict past the end of the sequences
FIXME	llama.cpp/tools/perplexity/perplexity.cpp:1244	this uses the wrong first logits when not skipping the choice word
TODO	llama.cpp/tools/perplexity/perplexity.cpp:1575	don't evaluate the last token of each sequence
TODO	llama.cpp/tools/quantize/quantize.cpp:75	share with imatrix.cpp
TODO	llama.cpp/tools/quantize/quantize.cpp:587	list multiple datasets when there are more than one
TODO	llama.cpp/tools/server/server-common.cpp:157	use the base64::decode from base64.hpp)
TODO	llama.cpp/tools/server/server-common.cpp:939	add audio_url support by reusing handle_media()
TODO	llama.cpp/tools/server/server-common.cpp:1003	test this properly */
TODO	llama.cpp/tools/server/server-common.cpp:1049	The response format of this option is not yet OAI-compatible, but seems like no one really using it; We may need to fix it in the future
TODO	llama.cpp/tools/server/server-common.cpp:1705	reuse llama_detokenize
TODO	llama.cpp/tools/server/server-common.cpp:1847	optimize this block by reducing memory allocations and movement
TODO	llama.cpp/tools/server/server-common.cpp:1868	make project name an input
TODO	llama.cpp/tools/server/server-common.cpp:1897	current filename
TODO	llama.cpp/tools/server/server-common.cpp:1904	configurable?)
TODO	llama.cpp/tools/server/server-common.h:152	server_tokens should be copyable - remove this:
TODO	llama.cpp/tools/server/server-common.h:303	move it to server-task.cpp
TODO	llama.cpp/tools/server/server-common.h:310	move it to server-task.cpp
TODO	llama.cpp/tools/server/server-common.h:346	move these to server-task.cpp
TODO	llama.cpp/tools/server/server-context.cpp:51	change to unique_ptrs for consistency:
TODO	llama.cpp/tools/server/server-context.cpp:59	move members that belong to the task (such as `generated_text`, `has_new_line`) to task_results_state
TODO	llama.cpp/tools/server/server-context.cpp:997	mtmd does not support prompt cache
TODO	llama.cpp/tools/server/server-context.cpp:1021	improve logic
TODO	llama.cpp/tools/server/server-context.cpp:1087	This will error out if a user requests two aloras, but only
TODO	llama.cpp/tools/server/server-context.cpp:1154	speculative decoding requires multiple samples per batch - not supported yet
TODO	llama.cpp/tools/server/server-context.cpp:1157	getting post/pre sampling logits is not yet supported with backend sampling
TODO	llama.cpp/tools/server/server-context.cpp:1160	tmp until backend sampling is fully implemented
TODO	llama.cpp/tools/server/server-context.cpp:1256	improve by not doing it more than once for each new line
TODO	llama.cpp/tools/server/server-context.cpp:1339	optimize this with min-p optimization
TODO	llama.cpp/tools/server/server-context.cpp:1957	simplify and improve
TODO	llama.cpp/tools/server/server-context.cpp:2042	rework to have a single draft llama_context shared across all slots [TAG_SERVER_SPEC_REWORK]
TODO	llama.cpp/tools/server/server-context.cpp:2127	maybe move branch to outside of this loop in the future
TODO	llama.cpp/tools/server/server-context.cpp:2164	support memory-less logits computation
TODO	llama.cpp/tools/server/server-context.cpp:2337	support can be added in the future when corresponding vision models get released
TODO	llama.cpp/tools/server/server-context.cpp:2476	try to make this conditional on the context or the memory module, instead of the model type
TODO	llama.cpp/tools/server/server-context.cpp:2627	try to terminate only the largest active slot/sequence and continue with the rest
TODO	llama.cpp/tools/server/server-context.cpp:2637	update slot state based on llama_memory_seq_pos_min() and llama_memory_seq_pos_max()
TODO	llama.cpp/tools/server/server-context.cpp:2641	handle ret == 2 (abort) when we start aborting
TODO	llama.cpp/tools/server/server-context.cpp:2768	set it here instead of doing inside populate_token_probs
TODO	llama.cpp/tools/server/server-context.cpp:2826	set result.probs
TODO	llama.cpp/tools/server/server-context.cpp:2963	this log can become very long, put it behind a flag or think about a more compact format
TODO	llama.cpp/tools/server/server-context.cpp:2977	this is inaccurate due to child tasks
TODO	llama.cpp/tools/server/server-context.cpp:3223	get rid of this dynamic_cast
TODO	llama.cpp/tools/server/server-context.cpp:3328	get rid of this dynamic_cast
TODO	llama.cpp/tools/server/server-context.cpp:3531	this could maybe be multimodal.
TODO	llama.cpp/tools/server/server-http.cpp:357	maybe handle sink.write unsuccessful? for now, we rely on is_connection_closed()
TODO	llama.cpp/tools/server/server-http.h:23	move this to a virtual function once we have proper polymorphism support
TODO	llama.cpp/tools/server/server-models.cpp:7	remove this once we use HTTP client from download.h
TODO	llama.cpp/tools/server/server-models.cpp:153	maybe validate preset before rendering ?
TODO	llama.cpp/tools/server/server-models.cpp:196	allow refreshing cached model list
TODO	llama.cpp/tools/server/server-models.cpp:800	add support for this on web UI
TODO	llama.cpp/tools/server/server-models.cpp:886	add other fields, may require reading GGUF metadata
TODO	llama.cpp/tools/server/server-models.h:24	also add downloading state when the logic is added
TODO	llama.cpp/tools/server/server-task.cpp:65	deduplicate?
TODO	llama.cpp/tools/server/server-task.cpp:123	deduplicate?
TODO	llama.cpp/tools/server/server-task.cpp:213	implement
TODO	llama.cpp/tools/server/server-task.cpp:279	add more sanity checks for the input parameters
TODO	llama.cpp/tools/server/server-task.cpp:413	we may want to throw errors here, in case "el" is incorrect
TODO	llama.cpp/tools/server/server-task.cpp:1902	for some reason we can't copy server_tokens, so we have to do this workaround
TODO	llama.cpp/tools/server/server-task.h:11	prevent including the whole server-common.h as we only use server_tokens
TODO	llama.cpp/tools/server/server-task.h:31	change this to more generic "response_format" to replace the "format_response_*" in server-common
TODO	llama.cpp/tools/server/server-task.h:63	implement
TODO	llama.cpp/tools/server/server-task.h:500	somehow reuse server_metrics in the future, instead of duplicating the fields
TODO	llama.cpp/tools/server/server.cpp:268	refactor in common/console
TODO	llama.cpp/tools/server/tests/unit/test_chat_completion.py:254	should not be a valid case
TODO	llama.cpp/tools/server/tests/unit/test_completion.py:163	remove this once test_cache_vs_nocache_prompt is fixed
TODO	llama.cpp/tools/server/tests/unit/test_completion.py:181	remove this once test_cache_vs_nocache_prompt is fixed
TODO	llama.cpp/tools/server/tests/unit/test_completion.py:201	remove this once test_cache_vs_nocache_prompt is fixed
FIXME	llama.cpp/tools/server/tests/unit/test_completion.py:369	the result is not deterministic when using other slot than slot 0
TODO	llama.cpp/tools/server/tests/unit/test_lora.py:59	remove this once test_cache_vs_nocache_prompt is fixed
TODO	llama.cpp/tools/server/tests/unit/test_lora.py:82	find & add other lora adapters for this model
TODO	llama.cpp/tools/server/tests/unit/test_lora.py:108	remove this once test_cache_vs_nocache_prompt is fixed
TODO	llama.cpp/tools/server/tests/unit/test_tool_call.py:422	fix these (wrong results, either didn't respect decimal instruction or got wrong value)
TODO	llama.cpp/tools/server/webui/src/lib/stores/models.svelte.ts:458	Remove this polling once llama-server properly waits for the operation
TODO	llama.cpp/tools/tokenize/tokenize.cpp:2	start using log.h
TODO	llama.cpp/tools/tokenize/tokenize.cpp:10	remove me
TODO	llama.cpp/tools/tokenize/tokenize.cpp:78	potential opportunity to roll common stuff into common/console.cpp
TODO	llama.cpp/tools/tokenize/tokenize.cpp:180	reporting invalid_utf8 would be useful on non-Windows too.
TODO	llama.cpp/tools/tts/convert_pt_to_hf.py:4	this script is LLM-generated and probably very inefficient and should be rewritten
TODO	llama.cpp/tools/tts/tts-outetts.py:148	load from json
TODO	llama.cpp/tools/tts/tts-outetts.py:181	tokenization is slow for some reason - here is pre-tokenized input
TODO	llama.cpp/tools/tts/tts.cpp:200	not optimized at all
TODO	llama.cpp/tools/tts/tts.cpp:273	can be done once
TODO	llama.cpp/tools/tts/tts.cpp:1022	all logits?
TODO	nonstd.h:76	%s\n", __FILE__, __LINE__, message); \
TODO	termbox2.h:2416	Assert global.back.(width,height) == global.front.(width,height)
TODO	termbox2.h:2540	iswprint ch?
TODO	termbox2.h:2662	\r, \t, \v, \f, etc?
TODO	termbox2.h:2948	Reorder TB_CAP_* so more critical caps come first.
TODO	termbox2.h:3497	Harden against errors encountered mid-resize
TODO	termbox2.h:4048	iswprint ch?