DeepDoc API Service - powered by Docling

A high-performance document processing API service built with FastAPI that uses Docling for intelligent document parsing, chunking, and embedding generation.

Features

Multi-format Support: Process PDF, DOCX, PPTX, HTML, TXT, JPG, JPEG, and PNG files
Intelligent Chunking: Uses Docling's HybridChunker for semantic text chunking
Embedding Generation: Integrated support for external embedding services
URL Processing: Download and process files from URLs and Google Drive links
Background Processing: Asynchronous job processing with status tracking
Callback Support: Webhook notifications for job completion
Redis Integration: Persistent job status and result storage

API Endpoints

POST `/upload_and_chunk`

Process documents from file uploads or URLs.

Parameters:

files: List of uploaded files (optional)
urls: List of URLs to process (optional)
from_page: Starting page number (default: 0)
to_page: Ending page number (default: 100000)
lang: Language for processing (default: "english")
with_embeddings: Generate embeddings for chunks (default: false)
callback_url: Webhook URL for job completion notifications

Response:

{"job_id": "uuid-string"}

GET `/job/{job_id}`

Check the status of a processing job.

Response:

{
  "status": "pending|processing|completed|failed|cancelled",
  "result": [/* array of chunks */],
  "error": "error message if failed"
}

POST `/job/{job_id}/cancel`

Cancel a running job.

GET `/health`

Comprehensive health check endpoint.

GET `/liveness`

Lightweight health check for Kubernetes liveness probes.

Installation

Docker (Recommended)

# Build the image
docker build -t docling-api .

# Run the container
docker run -p 8991:8991 \
  -e EMBEDDING_SERVICE_URL="your_embedding_url" \
  -e EMBEDDING_SERVICE_API_KEY="your_api_key" \
  -v ./uploads:/app/uploads \
  docling-api

Docker Compose

# Start with docker-compose
docker-compose up -d

Local Development

# Install dependencies
pip install -r requirements.txt

# Set environment variables
export EMBEDDING_SERVICE_URL="your_embedding_url"
export EMBEDDING_SERVICE_API_KEY="your_api_key"
export REDIS_URL="redis://localhost:6379/0"

# Run the server
uvicorn app.main:app --reload --host 0.0.0.0 --port 8991

Environment Variables

Variable	Description	Default
`API_HOST`	Host to bind the server to	`0.0.0.0`
`API_PORT`	Port to run the server on	`8991`
`EMBEDDING_SERVICE_URL`	URL of the embedding service	-
`EMBEDDING_SERVICE_API_KEY`	API key for embedding service	-
`EMBEDDING_MODEL`	Model to use for embeddings	`intfloat/multilingual-e5-large-instruct`
`REDIS_URL`	Redis connection URL	`redis://localhost:6379/0`
`UPLOAD_DIR`	Directory for uploaded files	`./uploads`
`PROCESSED_DIR`	Directory for processed files	`./processed`
`MAX_FILE_SIZE`	Maximum file size in bytes	`52428800` (50MB)

Usage Examples

Process Uploaded Files

curl -X POST "http://localhost:8991/upload_and_chunk" \
  -F "files=@document1.pdf" \
  -F "files=@document2.docx" \
  -F "from_page=0" \
  -F "to_page=100000" \
  -F "lang=english" \
  -F "with_embeddings=true"

Process URLs

curl -X POST "http://localhost:8991/upload_and_chunk" \
  -F "urls=https://example.com/document.pdf" \
  -F "urls=https://drive.google.com/file/d/GOOGLE_DRIVE_ID/view" \
  -F "with_embeddings=true"

Check Job Status

curl "http://localhost:8991/job/12345678-1234-1234-1234-123456789012"

Cancel a Job

curl -X POST "http://localhost:8991/job/12345678-1234-1234-1234-123456789012/cancel"

Response Format

Processed chunks are returned in the following format:

{
  "status": "completed",
  "result": [
    {
      "id": "chunk_001",
      "metadata": {
        "page_num_int": [1, 2],
        "original_filename": "document.pdf",
        "chunk_size": 1024,
        "chunk_overlap": 100
      },
      "text": "Chunk content text...",
      "embeddings": [0.1, 0.2, 0.3, ...]  // if with_embeddings=true
    }
  ],
  "error": null
}

Supported File Formats

Documents: PDF, DOCX, PPTX
Text: HTML, TXT
Images: JPG, JPEG, PNG (with OCR)
URLs: HTTP/HTTPS links, Google Drive links

Performance Considerations

Large Documents: The service handles large documents by processing in batches
Memory Management: Uses multiprocessing to isolate memory-intensive operations
Timeout Handling: Jobs timeout after 3 minutes by default (configurable)
Concurrency: Uses FastAPI's async capabilities and background tasks

Error Handling

The service provides comprehensive error handling:

Validation Errors: Proper HTTP status codes and error messages
Processing Failures: Graceful fallback to basic text extraction
Network Issues: Retry logic for external service calls
File Errors: Support for corrupted or malformed files

Monitoring

Health Endpoints: /health and /liveness for monitoring
Logging: Comprehensive logging for debugging and monitoring
Metrics: Built-in performance tracking (can be extended with Prometheus)

Development

Project Structure

app/
├── main.py              # FastAPI application and endpoints
├── models.py           # Pydantic models and data structures
├── processing.py       # Document processing logic
├── embeddings.py       # Embedding service integration
├── config.py          # Configuration settings
└── jivas_embeddings.py # Jivas embeddings client

Adding New Features

New File Formats: Add support in processing.py
New Embedding Services: Implement in embeddings.py
API Endpoints: Add to main.py with proper validation
Processing Logic: Extend DocumentProcessor class

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For support and questions:

Create an issue on GitHub
Check the API documentation at /docs when the service is running
Review the health endpoints for service status

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

Note: This service requires external embedding services to be configured for embedding generation. Without them, text chunking will still work, but embeddings will not be generated.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
app		app
.dockerignore		.dockerignore
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

DeepDoc API Service - powered by Docling

Features

API Endpoints

POST /upload_and_chunk

GET /job/{job_id}

POST /job/{job_id}/cancel

GET /health

GET /liveness

Installation

Docker (Recommended)

Docker Compose

Local Development

Environment Variables

Usage Examples

Process Uploaded Files

Process URLs

Check Job Status

Cancel a Job

Response Format

Supported File Formats

Performance Considerations

Error Handling

Monitoring

Development

Project Structure

Adding New Features

License

Support

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

POST `/upload_and_chunk`

GET `/job/{job_id}`

POST `/job/{job_id}/cancel`

GET `/health`

GET `/liveness`

Packages