A high-performance, modular General Inference API compatible with OpenAI's API specification. Built for local LLM inference with advanced context management, sliding window token optimization, and technical RAG capabilities.
- OpenAI-Compatible API: Drop-in replacement for OpenAI's chat completions API
- Multi-Provider Support: Ollama, OpenRouter, and extensible provider architecture
- Advanced Context Management: Sliding window token management with intelligent compression
- Technical RAG: Optimized retrieval for technical documentation and code
- Session-Based Architecture: Persistent conversation state with checkpointing
- Streaming Support: Real-time token streaming for all endpoints
- PostgreSQL Backend: Production-ready persistence with JSONB vector storage
- .NET 9.0 SDK
- PostgreSQL 14+
- Ollama (for local inference): https://ollama.ai
# Clone the repository
git clone https://github.com/Cstannahill/LocalInference
cd LocalInference
# Restore dependencies
dotnet restore
# Configure database (edit appsettings.json with PostgreSQL connection)
# Default: Host=localhost;Database=LocalInference;Username=postgres;Password=postgres
# Apply migrations
cd src/LocalInference.Api
dotnet ef database update
# Start the API
dotnet run# Health check
curl http://localhost:5000/health
# Create inference config
curl -X POST http://localhost:5000/api/inference-configs \
-H "Content-Type: application/json" \
-d '{
"name": "Ollama Default",
"modelIdentifier": "Qwen3.5-2B-UC",
"providerType": "Ollama",
"temperature": 0.7,
"topP": 0.9,
"contextWindow": 8192,
"maxTokens": 2048,
"isDefault": true
}'
# Test chat completions (requires Ollama with Qwen3.5-2B-UC model)
curl -X POST http://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.5-2B-UC",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false
}'- ✅ .NET 9.0 SDK installed:
dotnet --version - ✅ PostgreSQL running:
psql --version - ✅ Ollama running:
curl http://localhost:11434/api/tags - ✅ Model available:
ollama pull Qwen3.5-2B-UC - ✅ Migrations applied:
cd src/LocalInference.Api && dotnet ef migrations list - ✅ API running:
dotnet run(should show listening on port 5000)
http://localhost:5000
Currently, the API runs without authentication in local mode. For production deployment, add authentication middleware as needed.
Sessions replace the traditional character/persona model. A session:
- Maintains conversation history
- Has configurable context window size
- Supports multiple inference configurations
- Automatically manages token budgets
Reusable configuration profiles defining:
- Model identifier and provider
- Generation parameters (temperature, top_p, etc.)
- System prompts
- Token limits
Intelligent token management strategies:
- Sliding Window: Keeps most recent messages within budget
- Summarization: Compresses older messages into checkpoints
- Smart Compression: Automatically applies optimal strategy
Retrieval-augmented generation optimized for:
- Technical documentation
- Code references
- API documentation
- Structured knowledge bases
POST /v1/chat/completionsRequest Body:
{
"model": "Qwen3.5-2B-UC",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Hello!" }
],
"temperature": 0.7,
"max_tokens": 2048,
"stream": false
}Response:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1700000000,
"model": "llama3.2",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you today?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 9,
"total_tokens": 34
}
}Set stream: true for Server-Sent Events (SSE) streaming:
curl -X POST http://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Hello"}],
"stream": true
}'POST /api/sessions{
"name": "My Chat Session",
"description": "Technical discussion about APIs",
"inferenceConfigId": "550e8400-e29b-41d4-a716-446655440000",
"contextWindowTokens": 8192,
"maxOutputTokens": 2048
}GET /api/sessions?activeOnly=true&skip=0&take=100GET /api/sessions/{id}PUT /api/sessions/{id}{
"name": "Updated Session Name",
"contextWindowTokens": 16384
}DELETE /api/sessions/{id}GET /api/sessions/{id}/statisticsResponse:
{
"totalMessages": 42,
"totalTokens": 15360,
"averageMessageLength": 245,
"checkpointCount": 3,
"compressionRatio": 0.32,
"firstMessageAt": "2024-01-15T10:30:00Z",
"lastMessageAt": "2024-01-15T14:45:00Z"
}POST /api/configs{
"name": "Llama 3.2 Default",
"modelIdentifier": "llama3.2",
"providerType": "Ollama",
"temperature": 0.7,
"topP": 0.9,
"maxTokens": 2048,
"systemPrompt": "You are a helpful coding assistant.",
"isDefault": true
}GET /api/configsPUT /api/configs/{id}{
"temperature": 0.5,
"maxTokens": 4096
}POST /api/retrieval/query{
"query": "How do I configure dependency injection?",
"maxResults": 5,
"maxTokens": 2000,
"minScore": 0.7,
"documentTypes": ["TechnicalDocumentation", "CodeReference"],
"language": "csharp"
}Response:
{
"query": "How do I configure dependency injection?",
"results": [
{
"content": "To configure DI in ASP.NET Core, use services.AddSingleton<T>()...",
"source": "ASP.NET Core Documentation",
"score": 0.92,
"tokenCount": 156,
"documentType": "TechnicalDocumentation",
"language": "csharp"
}
]
}POST /api/retrieval/documents{
"title": "API Documentation",
"content": "Full document content here...",
"documentType": "TechnicalDocumentation",
"language": "csharp",
"framework": "ASP.NET Core"
}POST /api/retrieval/documents/{id}/indexPOST /api/retrieval/reindex{
"ConnectionStrings": {
"DefaultConnection": "Host=localhost;Database=LocalInference;Username=postgres;Password=postgres"
},
"Inference": {
"Ollama": {
"BaseUrl": "http://localhost:11434"
},
"OpenRouter": {
"ApiKey": "your-api-key-here"
}
}
}export ConnectionStrings__DefaultConnection="Host=localhost;Database=LocalInference;..."
export Inference__Ollama__BaseUrl="http://localhost:11434"
export Inference__OpenRouter__ApiKey="your-api-key"- Entities: Session, InferenceConfig, ContextMessage, ContextCheckpoint, TechnicalDocument
- Value Objects: TokenBudget, ContextWindowState, RetrievalResult
- Enums: MessageRole, InferenceProviderType, DocumentType
- Services: InferenceService, SessionManagementService, ContextManager
- Abstractions: IInferenceProvider, IEmbeddingProvider, ITechnicalRetrievalService
- Persistence: PostgreSQL with EF Core
- Inference: Ollama and OpenRouter providers
- Retrieval: Vector similarity search with cosine distance
- Summarization: Technical context compression
- Minimal API endpoints
- OpenAI-compatible request/response models
- Streaming support via SSE
// Compress context using specific strategy
await contextManager.CompressContextAsync(
sessionId,
CompressionStrategy.SmartCompression);
// Get current context state
var state = await contextManager.GetContextStateAsync(sessionId);
Console.WriteLine($"Utilization: {state.UtilizationRatio:P}");POST /v1/inference{
"configId": "550e8400-e29b-41d4-a716-446655440000",
"messages": [{ "role": "user", "content": "Explain quantum computing" }],
"temperature": 0.3,
"maxTokens": 1024
}{
"model": "llama3.2",
"messages": [{ "role": "user", "content": "What does this API do?" }],
"session_id": "existing-session-id",
"retrieval_context": [
{
"content": "The API provides LLM inference...",
"source": "Documentation",
"relevance_score": 0.95
}
]
}- Default context window: 8192 tokens
- Reserve for output: 2048 tokens
- Reserve for system: 512 tokens
- Available for context: ~5600 tokens
- Sessions: IsActive, LastActivityAt, CreatedAt
- ContextMessages: SessionId + SequenceNumber, IsSummarized
- TechnicalDocuments: DocumentType, IsIndexed, Language
- DocumentChunks: TechnicalDocumentId, ChunkIndex
- Inference configurations cached in memory
- Session state optimized for frequent reads
- Document embeddings stored with JSONB
Database Connection Failed
Check PostgreSQL is running and connection string is correct.
Ensure database exists: CREATE DATABASE LocalInference;
Ollama Connection Failed
Verify Ollama is running: curl http://localhost:11434/api/tags
Check base URL in configuration
Token Budget Exceeded
Reduce context window size or enable compression
Check session statistics for utilization ratio
# Enable debug logging
export Logging__LogLevel__Default=Debug
# View logs
dotnet run --verbosity diagnosticsrc/
├── LocalInference.Domain/ # Domain entities
├── LocalInference.Application/ # Business logic
├── LocalInference.Infrastructure/ # Data access, providers
└── LocalInference.Api/ # HTTP API
- Implement
IInferenceProviderinterface - Register in
InferenceProviderFactory - Add configuration options
- Create HTTP client registration
dotnet testMIT License - See LICENSE file for details.
- Fork the repository
- Create a feature branch
- Submit a pull request
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: This README and API docs