PDF to searchable text. Two tools in one package:
docindex-ocr— CLI that turns PDFs into text. Uses embedded text layers when present, falls back to Tesseract OCR otherwise. Supports per-document language auto-detection. Writes plain text or NDJSON.docindex-gui— desktop GUI (PySide6) that ingests.txtand.ndjsonoutputs into a Tantivy full-text index and provides ranked search with HTML snippets.
Requires Python 3.10+, Tesseract, and uv.
git clone git@github.com:polymood/DocumentIndexer.git
cd DocumentIndexer
uv syncSystem packages (Debian/Ubuntu/Fedora):
# Debian/Ubuntu
sudo apt install tesseract-ocr tesseract-ocr-eng tesseract-ocr-deu
# Fedora
sudo dnf install tesseract tesseract-langpack-eng tesseract-langpack-deuAdd more tesseract-ocr-<lang> packages as needed.
uv run docindex-ocr INPUT [INPUT ...] --output OUT_DIR [OPTIONS]INPUT— PDF file or directory (recursed for*.pdf).--output— destination directory. Mirrors the input layout under it.--format {txt,ndjson}— output format. Defaulttxt.ndjsonwrites one record per document with a metadata block plus the full text.--ocr-lang LANG— Tesseract language string (e.g.eng,eng+deu). Defaulteng+deu.--auto-lang— per-document language detection. Order of resolution:- parent-folder map (edit
LANG_MAPinlang.py), - lingua detector on any direct-extracted text,
- sample-OCR of page 0 with
eng, then lingua.
- parent-folder map (edit
--ocr-dpi N— render DPI for OCR. Default 200. Use 300 for small print.--ocr-workers N— number of pages OCR'd in parallel. Default = CPU count.--max-pages N— pages rendered per OCR batch. Default 24.--force-ocr— OCR even when a text layer exists.--workers N— number of PDFs processed in parallel. Default 1 (recommended: one PDF at a time, all cores on its pages).
Edit src/document_indexer/ocr/metadata.py. The build_metadata(pdf_path) function returns a dict that is merged into every NDJSON record. Add fields as needed (publication, year, source URL, etc.).
uv run docindex-gui [--preset PRESET.json]The window is a schema-driven Tantivy indexer. It is not a search tool — it only writes the index directory. Pair it with a separate search front-end of your choice (e.g. Quickwit, a CLI script, or your own UI).
Workflow:
- Paths — pick a source folder and an output directory for the Tantivy index. Choose TXT (one file = one document) or NDJSON (one line = one document).
- Schema — edit the field table. Each row defines one Tantivy field (
text/integer/date) plus the rule that pulls its value (from file contents, filename, regex on the filename, a JSON key, etc.). TheTXT defaultsandNDJSON defaultsbuttons seed sensible defaults. - Indexer params — Tantivy writer heap size (preset buttons + custom MB) and writer thread count.
- Run — progress bar, log, and a cancel button (rolls back uncommitted writes).
Drop a schema.ndjson file at the root of an NDJSON source folder to auto-load the schema. Each non-empty, non-comment line is one field definition.
Presets (paths + params + schema) can be saved/loaded from the toolbar or passed via --preset PATH.json on launch.
uv sync --extra dev
uv run pre-commit install
uv run pytest
uv run ruff check .
uv run ruff format .
uv run mypyMIT. See LICENSE.