Engage!

author: Mitja Felicijan <mitja.felicijan@gmail.com> 2026-02-12 20:57:17 +0100
committer: Mitja Felicijan <mitja.felicijan@gmail.com> 2026-02-12 20:57:17 +0100
commit: b333b06772c89d96aacb5490d6a219fba7c09cc6 (patch)
tree: 211df60083a5946baa2ed61d33d8121b7e251b06 /llama.cpp/examples/parallel/README.md
download: llmnpc-b333b06772c89d96aacb5490d6a219fba7c09cc6.tar.gz
1 files changed, 14 insertions, 0 deletions
diff --git a/llama.cpp/examples/parallel/README.md b/llama.cpp/examples/parallel/README.md
new file mode 100644
index 0000000..2468a30
--- /dev/null
+++ b/llama.cpp/examples/parallel/README.md
@@ -0,0 +1,14 @@
+# llama.cpp/example/parallel
+Simplified simulation of serving incoming requests in parallel
+## Example
+Generate 128 client requests (`-ns 128`), simulating 8 concurrent clients (`-np 8`). The system prompt is shared (`-pps`), meaning that it is computed once at the start. The client requests consist of up to 10 junk questions (`--junk 10`) followed by the actual question.
+```bash
+llama-parallel -m model.gguf -np 8 -ns 128 --top-k 1 -pps --junk 10 -c 16384
+```
+> [!NOTE]
+> It's recommended to use base models with this example. Instruction tuned models might not be able to properly follow the custom chat template specified here, so the results might not be as expected.
author	Mitja Felicijan <mitja.felicijan@gmail.com>	2026-02-12 20:57:17 +0100
committer	Mitja Felicijan <mitja.felicijan@gmail.com>	2026-02-12 20:57:17 +0100
commit	b333b06772c89d96aacb5490d6a219fba7c09cc6 (patch)
tree	211df60083a5946baa2ed61d33d8121b7e251b06 /llama.cpp/examples/parallel/README.md
download	llmnpc-b333b06772c89d96aacb5490d6a219fba7c09cc6.tar.gz

diff --git a/llama.cpp/examples/parallel/README.md b/llama.cpp/examples/parallel/README.md new file mode 100644 index 0000000..2468a30 --- /dev/null +++ b/llama.cpp/examples/parallel/README.md
@@ -0,0 +1,14 @@
	1	# llama.cpp/example/parallel
	2
	3	Simplified simulation of serving incoming requests in parallel
	4
	5	## Example
	6
	7	Generate 128 client requests (`-ns 128`), simulating 8 concurrent clients (`-np 8`). The system prompt is shared (`-pps`), meaning that it is computed once at the start. The client requests consist of up to 10 junk questions (`--junk 10`) followed by the actual question.
	8
	9	```bash
	10	llama-parallel -m model.gguf -np 8 -ns 128 --top-k 1 -pps --junk 10 -c 16384
	11	```
	12
	13	> [!NOTE]
	14	> It's recommended to use base models with this example. Instruction tuned models might not be able to properly follow the custom chat template specified here, so the results might not be as expected.