DocumentIndexer

PDF to searchable text. Two tools in one package:

docindex-ocr — CLI that turns PDFs into text. Uses embedded text layers when present, falls back to Tesseract OCR otherwise. Supports per-document language auto-detection. Writes plain text or NDJSON.
docindex-gui — desktop GUI (PySide6) that ingests .txt and .ndjson outputs into a Tantivy full-text index and provides ranked search with HTML snippets.

Install

Requires Python 3.10+, Tesseract, and uv.

git clone git@github.com:polymood/DocumentIndexer.git
cd DocumentIndexer
uv sync

System packages (Debian/Ubuntu/Fedora):

# Debian/Ubuntu
sudo apt install tesseract-ocr tesseract-ocr-eng tesseract-ocr-deu

# Fedora
sudo dnf install tesseract tesseract-langpack-eng tesseract-langpack-deu

Add more tesseract-ocr-<lang> packages as needed.

OCR pipeline

uv run docindex-ocr INPUT [INPUT ...] --output OUT_DIR [OPTIONS]

INPUT — PDF file or directory (recursed for *.pdf).
--output — destination directory. Mirrors the input layout under it.
--format {txt,ndjson} — output format. Default txt. ndjson writes one record per document with a metadata block plus the full text.
--ocr-lang LANG — Tesseract language string (e.g. eng, eng+deu). Default eng+deu.
--auto-lang — per-document language detection. Order of resolution:
1. parent-folder map (edit LANG_MAP in lang.py),
2. lingua detector on any direct-extracted text,
3. sample-OCR of page 0 with eng, then lingua.
--ocr-dpi N — render DPI for OCR. Default 200. Use 300 for small print.
--ocr-workers N — number of pages OCR'd in parallel. Default = CPU count.
--max-pages N — pages rendered per OCR batch. Default 24.
--force-ocr — OCR even when a text layer exists.
--workers N — number of PDFs processed in parallel. Default 1 (recommended: one PDF at a time, all cores on its pages).

Customising NDJSON metadata

Edit src/document_indexer/ocr/metadata.py. The build_metadata(pdf_path) function returns a dict that is merged into every NDJSON record. Add fields as needed (publication, year, source URL, etc.).

Indexer GUI

uv run docindex-gui [--preset PRESET.json]

The window is a schema-driven Tantivy indexer. It is not a search tool — it only writes the index directory. Pair it with a separate search front-end of your choice (e.g. Quickwit, a CLI script, or your own UI).

Workflow:

Paths — pick a source folder and an output directory for the Tantivy index. Choose TXT (one file = one document) or NDJSON (one line = one document).
Schema — edit the field table. Each row defines one Tantivy field (text/integer/date) plus the rule that pulls its value (from file contents, filename, regex on the filename, a JSON key, etc.). The TXT defaults and NDJSON defaults buttons seed sensible defaults.
Indexer params — Tantivy writer heap size (preset buttons + custom MB) and writer thread count.
Run — progress bar, log, and a cancel button (rolls back uncommitted writes).

Drop a schema.ndjson file at the root of an NDJSON source folder to auto-load the schema. Each non-empty, non-comment line is one field definition.

Presets (paths + params + schema) can be saved/loaded from the toolbar or passed via --preset PATH.json on launch.

Development

uv sync --extra dev
uv run pre-commit install
uv run pytest
uv run ruff check .
uv run ruff format .
uv run mypy

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src/document_indexer		src/document_indexer
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocumentIndexer

Install

OCR pipeline

Customising NDJSON metadata

Indexer GUI

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DocumentIndexer

Install

OCR pipeline

Customising NDJSON metadata

Indexer GUI

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages