LLM Proxy Architecture

This diagram illustrates the components and data flows of the LLM proxy system.

System Overview

flowchart TB Client[Client Applications] --> APIServer subgraph APIServer["API Server Container"] API[Express API] --> RequestHandler[Request Handler] ResponseStreamHandler[Response Stream Handler] RequestHandler --> Producer1[Producer] Consumer1[Consumer] --> ResponseStreamHandler end subgraph RabbitMQ["RabbitMQ Message Broker"] RequestExchange[(Request Exchange)] RequestQueue[(Request Queue)] ResponseExchange[(Response Exchange)] ServerQueues[(Server-Specific Queues)] AuditRequestQueue[(Request Audit Queue)] AuditResponseQueue[(Response Audit Queue)] RequestExchange --> RequestQueue RequestExchange --> AuditRequestQueue ResponseExchange --> ServerQueues ResponseExchange --> AuditResponseQueue end subgraph WorkerContainer["Worker Containers (Scalable)"] Worker1[LLM Worker 1] Worker2[LLM Worker 2] Worker3[LLM Worker 3] end subgraph AuditContainer["Audit Service Container"] AuditService[Audit Service] AuditConsumer[Consumer] AuditConsumer --> AuditService end subgraph LLMProviders["LLM Providers"] Ollama[Ollama API] Mock[Mock AiProvider] end subgraph Storage["Storage"] PostgreSQL[(PostgreSQL Database)] end Producer1 --> RequestExchange RequestQueue --> WorkerContainer Worker1 & Worker2 & Worker3 --> ResponseExchange Worker1 & Worker2 & Worker3 --> LLMProviders ServerQueues --> Consumer1 AuditRequestQueue & AuditResponseQueue --> AuditConsumer AuditService --> PostgreSQL ResponseStreamHandler --> Client classDef container fill:#e9f7f2,stroke:#333,stroke-width:2px classDef queue fill:#ffe6cc,stroke:#333 classDef service fill:#d5e8d4,stroke:#333 classDef database fill:#f8cecc,stroke:#333 classDef client fill:#dae8fc,stroke:#333 class APIServer,WorkerContainer,AuditContainer container class RequestExchange,ResponseExchange,RequestQueue,ServerQueues,AuditRequestQueue,AuditResponseQueue queue class API,Worker1,Worker2,Worker3,AuditService,Ollama,Mock service class PostgreSQL database class Client client

The system uses a message queue architecture to decouple components and enable horizontal scaling:

  • API Layer: Handles client requests and initiates streaming responses
  • Request Queue: Buffers incoming requests for processing
  • LLM Workers: Process requests by calling LLM providers
  • Response Queues: Server-specific queues that route responses back to clients
  • Audit Queues: Capture request and response data for logging and analysis
  • LLM Providers: Abstractions for different LLM implementations (Ollama, etc.)
  • Audit Service: Logs data to PostgreSQL for metrics and monitoring

Request Flow

sequenceDiagram participant Client as Client participant API as API Server participant RQ as Request Queue participant Worker as LLM Worker participant LLM as LLM AiProvider participant RespQ as Response Queue participant AuditQ as Audit Queue Client->>API: POST /api/chat Note over API: Create requestId API-->>Client: Start SSE stream API->>RQ: Send request message Note over RQ: Fanout exchange RQ->>Worker: Consume request Worker->>LLM: Process with provider loop For each token LLM-->>Worker: Token generation Worker->>RespQ: Send token chunk Worker->>AuditQ: Send for logging RespQ-->>API: Stream to client API-->>Client: SSE event end LLM-->>Worker: Complete response Worker->>RespQ: Send final response Worker->>AuditQ: Send metrics RespQ-->>API: Final response API-->>Client: End SSE stream

The request flow demonstrates how data moves through the system:

  1. Client sends a request to the API
  2. API generates a request ID and creates a server-sent events (SSE) stream
  3. API sends the request to the request queue
  4. A worker picks up the request and processes it with the LLM provider
  5. As tokens are generated, they are sent to:
    • The response queue (routed to the specific server)
    • The audit queue (for logging)
  6. The API streams tokens back to the client in real-time
  7. When complete, the worker sends the final response with metrics

AiProvider Interface Pattern

classDiagram class LLMProviderInterface { +init() +getModels() +generate(params, onToken, onComplete, onError) +chat(params, onToken, onComplete, onError) } class OllamaProvider { -config -ollama +init() +getModels() +generate() +chat() } class MockProvider { -config +init() +getModels() +generate() +chat() } class ProviderFactory { +getProvider(type) +getModels() } LLMProviderInterface <|-- OllamaProvider LLMProviderInterface <|-- MockProvider ProviderFactory --> LLMProviderInterface

The provider interface pattern enables support for multiple LLM implementations:

  • LLMProviderInterface: Defines the contract that all providers must implement
  • OllamaProvider: Implementation for Ollama LLM
  • MockProvider: Provides mock responses for testing
  • ProviderFactory: Creates the appropriate provider based on configuration

This pattern allows easy addition of new LLM providers without changing the core system.

Monitoring and Audit

flowchart LR AuditQ[Audit Queues] --> AuditSvc[Audit Service] AuditSvc --> DB[(PostgreSQL)] DB --> MonitorSvc[Monitoring Service] MonitorSvc --> Dashboard[Dashboard UI] subgraph "Audit Data Flow" direction TB ReqAudit[Request Audit] RespAudit[Response Audit] Metrics[Performance Metrics] ModelStats[Model Statistics] WorkerStats[Worker Statistics] ReqAudit --> Metrics RespAudit --> Metrics Metrics --> ModelStats Metrics --> WorkerStats end AuditSvc --> ReqAudit AuditSvc --> RespAudit MonitorSvc --> Metrics style AuditQ fill:#ffe6cc,stroke:#333 style AuditSvc fill:#fff2cc,stroke:#333 style DB fill:#f8cecc,stroke:#333 style MonitorSvc fill:#d5e8d4,stroke:#333 style Dashboard fill:#d4f1f9,stroke:#333

The monitoring and audit subsystem collects performance data and provides visibility:

  • Audit Queues: Separate queues for request and response audit data
  • Audit Service: Consumes audit messages and stores in PostgreSQL
  • Monitoring Service: Extracts metrics and statistics from the database
  • Dashboard UI: Visualizes performance metrics and system health

Key metrics collected include:

  • Queue depth and processing throughput
  • Token generation speed (tokens per second)
  • Model usage statistics
  • Worker performance metrics
  • Request/response history for debugging