Skip to content

Cstannahill/LocalInference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LocalInference API

A high-performance, modular General Inference API compatible with OpenAI's API specification. Built for local LLM inference with advanced context management, sliding window token optimization, and technical RAG capabilities.

Features

  • OpenAI-Compatible API: Drop-in replacement for OpenAI's chat completions API
  • Multi-Provider Support: Ollama, OpenRouter, and extensible provider architecture
  • Advanced Context Management: Sliding window token management with intelligent compression
  • Technical RAG: Optimized retrieval for technical documentation and code
  • Session-Based Architecture: Persistent conversation state with checkpointing
  • Streaming Support: Real-time token streaming for all endpoints
  • PostgreSQL Backend: Production-ready persistence with JSONB vector storage

Quick Start

Prerequisites

Installation

# Clone the repository
git clone https://github.com/Cstannahill/LocalInference
cd LocalInference

# Restore dependencies
dotnet restore

# Configure database (edit appsettings.json with PostgreSQL connection)
# Default: Host=localhost;Database=LocalInference;Username=postgres;Password=postgres

# Apply migrations
cd src/LocalInference.Api
dotnet ef database update

# Start the API
dotnet run

Verify Installation

# Health check
curl http://localhost:5000/health

# Create inference config
curl -X POST http://localhost:5000/api/inference-configs \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Ollama Default",
    "modelIdentifier": "Qwen3.5-2B-UC",
    "providerType": "Ollama",
    "temperature": 0.7,
    "topP": 0.9,
    "contextWindow": 8192,
    "maxTokens": 2048,
    "isDefault": true
  }'

# Test chat completions (requires Ollama with Qwen3.5-2B-UC model)
curl -X POST http://localhost:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-2B-UC",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }'

First Run Checklist

  • ✅ .NET 9.0 SDK installed: dotnet --version
  • ✅ PostgreSQL running: psql --version
  • ✅ Ollama running: curl http://localhost:11434/api/tags
  • ✅ Model available: ollama pull Qwen3.5-2B-UC
  • ✅ Migrations applied: cd src/LocalInference.Api && dotnet ef migrations list
  • ✅ API running: dotnet run (should show listening on port 5000)

API Overview

Base URL

http://localhost:5000

Authentication

Currently, the API runs without authentication in local mode. For production deployment, add authentication middleware as needed.

Core Concepts

Sessions

Sessions replace the traditional character/persona model. A session:

  • Maintains conversation history
  • Has configurable context window size
  • Supports multiple inference configurations
  • Automatically manages token budgets

Inference Configurations

Reusable configuration profiles defining:

  • Model identifier and provider
  • Generation parameters (temperature, top_p, etc.)
  • System prompts
  • Token limits

Context Management

Intelligent token management strategies:

  • Sliding Window: Keeps most recent messages within budget
  • Summarization: Compresses older messages into checkpoints
  • Smart Compression: Automatically applies optimal strategy

Technical RAG

Retrieval-augmented generation optimized for:

  • Technical documentation
  • Code references
  • API documentation
  • Structured knowledge bases

API Endpoints

Chat Completions (OpenAI Compatible)

Create Chat Completion

POST /v1/chat/completions

Request Body:

{
  "model": "Qwen3.5-2B-UC",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Hello!" }
  ],
  "temperature": 0.7,
  "max_tokens": 2048,
  "stream": false
}

Response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1700000000,
  "model": "llama3.2",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 9,
    "total_tokens": 34
  }
}

Streaming

Set stream: true for Server-Sent Events (SSE) streaming:

curl -X POST http://localhost:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

Session Management

Create Session

POST /api/sessions
{
  "name": "My Chat Session",
  "description": "Technical discussion about APIs",
  "inferenceConfigId": "550e8400-e29b-41d4-a716-446655440000",
  "contextWindowTokens": 8192,
  "maxOutputTokens": 2048
}

List Sessions

GET /api/sessions?activeOnly=true&skip=0&take=100

Get Session

GET /api/sessions/{id}

Update Session

PUT /api/sessions/{id}
{
  "name": "Updated Session Name",
  "contextWindowTokens": 16384
}

Delete Session

DELETE /api/sessions/{id}

Get Session Statistics

GET /api/sessions/{id}/statistics

Response:

{
  "totalMessages": 42,
  "totalTokens": 15360,
  "averageMessageLength": 245,
  "checkpointCount": 3,
  "compressionRatio": 0.32,
  "firstMessageAt": "2024-01-15T10:30:00Z",
  "lastMessageAt": "2024-01-15T14:45:00Z"
}

Inference Configuration

Create Configuration

POST /api/configs
{
  "name": "Llama 3.2 Default",
  "modelIdentifier": "llama3.2",
  "providerType": "Ollama",
  "temperature": 0.7,
  "topP": 0.9,
  "maxTokens": 2048,
  "systemPrompt": "You are a helpful coding assistant.",
  "isDefault": true
}

List Configurations

GET /api/configs

Update Configuration

PUT /api/configs/{id}
{
  "temperature": 0.5,
  "maxTokens": 4096
}

Retrieval (RAG)

Query Documents

POST /api/retrieval/query
{
  "query": "How do I configure dependency injection?",
  "maxResults": 5,
  "maxTokens": 2000,
  "minScore": 0.7,
  "documentTypes": ["TechnicalDocumentation", "CodeReference"],
  "language": "csharp"
}

Response:

{
  "query": "How do I configure dependency injection?",
  "results": [
    {
      "content": "To configure DI in ASP.NET Core, use services.AddSingleton<T>()...",
      "source": "ASP.NET Core Documentation",
      "score": 0.92,
      "tokenCount": 156,
      "documentType": "TechnicalDocumentation",
      "language": "csharp"
    }
  ]
}

Add Document

POST /api/retrieval/documents
{
  "title": "API Documentation",
  "content": "Full document content here...",
  "documentType": "TechnicalDocumentation",
  "language": "csharp",
  "framework": "ASP.NET Core"
}

Index Document

POST /api/retrieval/documents/{id}/index

Reindex All Documents

POST /api/retrieval/reindex

Configuration

appsettings.json

{
  "ConnectionStrings": {
    "DefaultConnection": "Host=localhost;Database=LocalInference;Username=postgres;Password=postgres"
  },
  "Inference": {
    "Ollama": {
      "BaseUrl": "http://localhost:11434"
    },
    "OpenRouter": {
      "ApiKey": "your-api-key-here"
    }
  }
}

Environment Variables

export ConnectionStrings__DefaultConnection="Host=localhost;Database=LocalInference;..."
export Inference__Ollama__BaseUrl="http://localhost:11434"
export Inference__OpenRouter__ApiKey="your-api-key"

Architecture

Domain Layer

  • Entities: Session, InferenceConfig, ContextMessage, ContextCheckpoint, TechnicalDocument
  • Value Objects: TokenBudget, ContextWindowState, RetrievalResult
  • Enums: MessageRole, InferenceProviderType, DocumentType

Application Layer

  • Services: InferenceService, SessionManagementService, ContextManager
  • Abstractions: IInferenceProvider, IEmbeddingProvider, ITechnicalRetrievalService

Infrastructure Layer

  • Persistence: PostgreSQL with EF Core
  • Inference: Ollama and OpenRouter providers
  • Retrieval: Vector similarity search with cosine distance
  • Summarization: Technical context compression

API Layer

  • Minimal API endpoints
  • OpenAI-compatible request/response models
  • Streaming support via SSE

Advanced Usage

Custom Context Management

// Compress context using specific strategy
await contextManager.CompressContextAsync(
    sessionId,
    CompressionStrategy.SmartCompression);

// Get current context state
var state = await contextManager.GetContextStateAsync(sessionId);
Console.WriteLine($"Utilization: {state.UtilizationRatio:P}");

Direct Inference

POST /v1/inference
{
  "configId": "550e8400-e29b-41d4-a716-446655440000",
  "messages": [{ "role": "user", "content": "Explain quantum computing" }],
  "temperature": 0.3,
  "maxTokens": 1024
}

Session with Retrieval Context

{
  "model": "llama3.2",
  "messages": [{ "role": "user", "content": "What does this API do?" }],
  "session_id": "existing-session-id",
  "retrieval_context": [
    {
      "content": "The API provides LLM inference...",
      "source": "Documentation",
      "relevance_score": 0.95
    }
  ]
}

Performance Optimization

Token Budget Management

  • Default context window: 8192 tokens
  • Reserve for output: 2048 tokens
  • Reserve for system: 512 tokens
  • Available for context: ~5600 tokens

Database Indexes

  • Sessions: IsActive, LastActivityAt, CreatedAt
  • ContextMessages: SessionId + SequenceNumber, IsSummarized
  • TechnicalDocuments: DocumentType, IsIndexed, Language
  • DocumentChunks: TechnicalDocumentId, ChunkIndex

Caching Strategies

  • Inference configurations cached in memory
  • Session state optimized for frequent reads
  • Document embeddings stored with JSONB

Troubleshooting

Common Issues

Database Connection Failed

Check PostgreSQL is running and connection string is correct.
Ensure database exists: CREATE DATABASE LocalInference;

Ollama Connection Failed

Verify Ollama is running: curl http://localhost:11434/api/tags
Check base URL in configuration

Token Budget Exceeded

Reduce context window size or enable compression
Check session statistics for utilization ratio

Logs

# Enable debug logging
export Logging__LogLevel__Default=Debug

# View logs
dotnet run --verbosity diagnostic

Development

Project Structure

src/
├── LocalInference.Domain/          # Domain entities
├── LocalInference.Application/     # Business logic
├── LocalInference.Infrastructure/  # Data access, providers
└── LocalInference.Api/             # HTTP API

Adding a New Provider

  1. Implement IInferenceProvider interface
  2. Register in InferenceProviderFactory
  3. Add configuration options
  4. Create HTTP client registration

Running Tests

dotnet test

License

MIT License - See LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Submit a pull request

Support

  • Issues: GitHub Issues
  • Discussions: GitHub Discussions
  • Documentation: This README and API docs

About

A high-performance, modular General Inference API compatible with OpenAI's API specification. Built for local LLM inference with advanced context management, sliding window token optimization, and technical RAG capabilities.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages