cpp-llamalib

A C++17 single-file header-only wrapper for llama.cpp.
Just include cpp-llamalib.h to call llama.cpp with a simple, high-level API.

This library is used in "Building a Desktop LLM App with cpp-httplib".

Features

Header-only — single file, no build step for the wrapper itself
Two-tier API: raw generate() and template-aware chat()
ChatSession wrapper for multi-turn conversations
KV cache reuse — shared prompt prefixes are not re-processed
Stop sequences — halt generation when a specified string appears
Tokenize API — count tokens before generation to check context limits
Structured output via GBNF grammar constraints
Thread-safe concurrent generation via slot pool
Custom sampler configuration

Integration

Please copy cpp-llamalib.h into your project and #include it.

Your project must link against llama.cpp (llama and ggml libraries).
Tested against llama.cpp b8775, 2026-04-13.

Examples

Raw text generation

generate() sends the prompt directly to the model without any chat template formatting.

#include "cpp-llamalib.h"

llamalib::Llama llm("model.gguf");
auto result = llm.generate("Once upon a time");

Chat

chat() applies the model's embedded chat template before generation.

// Single message (auto-wrapped as "user" role)
auto result = llm.chat("Explain C++ RAII in one sentence.");

// Multi-turn with explicit history management
std::vector<llamalib::Message> messages = {
    {"system", "You are a helpful assistant."},
    {"user", "What is RAII?"},
};
auto r1 = llm.chat(messages);

messages.push_back({"assistant", r1});
messages.push_back({"user", "Give me an example."});
auto r2 = llm.chat(messages);

ChatSession

ChatSession manages conversation history automatically. Create one via Llama::session().

auto session = llm.session("You are a C++ expert.");
auto r1 = session.say("What is a smart pointer?");
auto r2 = session.say("How does unique_ptr differ from shared_ptr?");

session.clear();  // Reset history (system prompt is preserved)

Streaming

Pass a callback to receive tokens as they are generated. Return false from the callback to stop early. Works with both generate() and chat().

llm.chat("Write a haiku about the sea.", [](std::string_view token) {
    std::cout << token << std::flush;
    return true;  // return false to stop early
});

Structured output (GBNF grammar)

Use GenerateOptions with a GBNF grammar string to constrain model output. Works with generate(), chat(), and ChatSession::say().

llamalib::GenerateOptions opts;
opts.max_tokens = 32;
opts.grammar = R"(root ::= "yes" | "no")";

auto answer = llm.chat("Is the sky blue?", opts);
// answer is guaranteed to be "yes" or "no"

Stop sequences

Use GenerateOptions::stop to halt generation when any of the specified strings appears. The stop sequence itself is excluded from the output. Works with generate(), chat(), and streaming.

llamalib::GenerateOptions opts;
opts.max_tokens = 128;
opts.stop = {"\n", "。"};

auto result = llm.chat("List one item:", opts);
// Generation stops at the first newline or period

Tokenize

Use tokenize() or token_count() to inspect token counts before generation — useful for checking context limits.

auto tokens = llm.tokenize("Hello, world!");  // std::vector<llama_token>
auto count = llm.token_count("Hello, world!"); // count == tokens.size()

// Check context limits before generating
if (llm.token_count(prompt) > 1800) {
    // Truncate or split the prompt
}

KV cache reuse

KV cache is automatically reused across calls on the same slot. When consecutive prompts share a common prefix (e.g. multi-turn chat), only the new tokens are decoded — previous KV entries are kept. Use clear_cache() to explicitly reset all slots.

llm.generate("The quick brown fox jumps", {16});  // full decode
llm.generate("The quick brown fox runs", {16});   // only re-decodes "runs"

llm.clear_cache();  // force full re-processing on next call

Custom parameters

Configure context size, temperature, and concurrency via Options.

llamalib::Options opts;
opts.n_ctx = 4096;
opts.temperature = 0.7f;
opts.n_slots = 4;  // concurrent generation slots

llamalib::Llama llm("model.gguf", opts);

Error handling

All errors are reported as std::runtime_error — invalid model paths, tokenization failures, and context overflow.

try {
    llamalib::Llama llm("/bad/path.gguf");
} catch (const std::runtime_error &e) {
    // "Failed to load model"
}

try {
    // Throws if prompt + max_tokens exceeds context size
    auto result = llm.generate(very_long_prompt);
} catch (const std::runtime_error &e) {
    // "Prompt too long: 1024 prompt tokens + 512 max_tokens exceeds context size 1024"
}

Concurrent generation

Set n_slots to enable multiple concurrent generations. Calls to generate() are thread-safe — if all slots are busy, callers block until a slot becomes available.

llamalib::Options opts;
opts.n_slots = 2;
llamalib::Llama llm("model.gguf", opts);

// 4 requests on 2 slots: 2 run immediately, 2 wait for a free slot
std::vector<std::future<std::string>> futures;
for (int i = 0; i < 4; i++) {
    futures.push_back(std::async(std::launch::async, [&llm] {
        return llm.generate("Hello");
    }));
}

Custom sampler

Override the default sampler chain by providing a sampler_config callback. You have full access to llama.cpp's sampler API.

llamalib::Options opts;
opts.sampler_config = [](llama_sampler *chain) {
    llama_sampler_chain_add(chain, llama_sampler_init_temp(0.0f));
    llama_sampler_chain_add(chain, llama_sampler_init_dist(42));
};

llamalib::Llama llm("model.gguf", opts);

API Reference

`llamalib::Options`

Field	Type	Default	Description
`n_gpu_layers`	`int`	`99`	Number of layers to offload to GPU
`n_ctx`	`int`	`2048`	Context window size
`temperature`	`float`	`0.3f`	Sampling temperature
`n_slots`	`int`	`1`	Number of concurrent generation slots
`sampler_config`	`SamplerConfig`	`nullptr`	Custom sampler chain configuration callback

`llamalib::GenerateOptions`

Field	Type	Default	Description
`max_tokens`	`int`	`-1`	Maximum tokens to generate (-1 = until EOS or context limit)
`grammar`	`std::string`	`""`	GBNF grammar string (empty = no constraint)
`grammar_root`	`std::string`	`""`	Root rule name (defaults to `"root"`)
`stop`	`std::vector<std::string>`	`{}`	Stop sequences (generation stops when any appears; empty strings are ignored)

`llamalib::Llama`

Constructor

Llama(const std::string &model_path, const Options &params = {})

Loads the model and creates context/sampler slots. Throws std::runtime_error on failure.

`generate`

std::string generate(const std::string &prompt, const GenerateOptions &opts = {})
void generate(const std::string &prompt, const StreamCallback &callback)
void generate(const std::string &prompt, const GenerateOptions &opts, const StreamCallback &callback)

Raw text completion — sends the prompt directly without chat template formatting. Thread-safe. Throws std::runtime_error if the prompt is too long for the context window or if decoding fails.

`chat`

// Single message (auto-wrapped as "user" role)
std::string chat(const std::string &message, const GenerateOptions &opts = {})
void chat(const std::string &message, const StreamCallback &callback)
void chat(const std::string &message, const GenerateOptions &opts, const StreamCallback &callback)

// Multi-turn
std::string chat(const std::vector<Message> &messages, const GenerateOptions &opts = {})
void chat(const std::vector<Message> &messages, const StreamCallback &callback)
void chat(const std::vector<Message> &messages, const GenerateOptions &opts, const StreamCallback &callback)

Applies the model's embedded chat template, then generates. Falls back to raw concatenation if the model has no template. Throws std::runtime_error if the prompt is too long for the context window or if decoding fails.

`tokenize`

std::vector<llama_token> tokenize(const std::string &text) const

Tokenizes the given text and returns the token IDs.

`token_count`

size_t token_count(const std::string &text) const

Returns the number of tokens in the given text. Equivalent to tokenize(text).size().

`clear_cache`

void clear_cache()

Clears the KV cache and cached token history for all slots. Use when you want to force full prompt re-processing on the next call.

`llamalib::Message`

struct Message {
    std::string role;     // "system", "user", or "assistant"
    std::string content;
};

`llamalib::ChatSession`

Created via Llama::session():

auto session = llm.session("You are a helpful assistant.");

Wraps Llama::chat() with automatic conversation history management.

Method	Description
`say(message, opts = {})`	Appends user message, calls `chat()`, appends assistant reply, returns reply
`say(message, callback)`	Streaming version (void); history is still updated internally
`say(message, opts, callback)`	Streaming with options
`history()`	Returns `const std::vector<Message>&` of the conversation
`clear()`	Clears history (preserves system prompt if set)

Type aliases

using StreamCallback = std::function<bool(std::string_view token)>;
using SamplerConfig = std::function<void(llama_sampler *chain)>;

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
scripts		scripts
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
cpp-llamalib.h		cpp-llamalib.h
justfile		justfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cpp-llamalib

Features

Integration

Examples

Raw text generation

Chat

ChatSession

Streaming

Structured output (GBNF grammar)

Stop sequences

Tokenize

KV cache reuse

Custom parameters

Error handling

Concurrent generation

Custom sampler

API Reference

`llamalib::Options`

`llamalib::GenerateOptions`

`llamalib::Llama`

Constructor

`generate`

`chat`

`tokenize`

`token_count`

`clear_cache`

`llamalib::Message`

`llamalib::ChatSession`

Type aliases

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cpp-llamalib

Features

Integration

Examples

Raw text generation

Chat

ChatSession

Streaming

Structured output (GBNF grammar)

Stop sequences

Tokenize

KV cache reuse

Custom parameters

Error handling

Concurrent generation

Custom sampler

API Reference

llamalib::Options

llamalib::GenerateOptions

llamalib::Llama

Constructor

generate

chat

tokenize

token_count

clear_cache

llamalib::Message

llamalib::ChatSession

Type aliases

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

`llamalib::Options`

`llamalib::GenerateOptions`

`llamalib::Llama`

`generate`

`chat`

`tokenize`

`token_count`

`clear_cache`

`llamalib::Message`

`llamalib::ChatSession`