LLM Proxy Architecture
This diagram illustrates the components and data flows of the LLM proxy system.
System Overview
flowchart TB
Client[Client Applications] --> APIServer
subgraph APIServer["API Server Container"]
API[Express API] --> RequestHandler[Request Handler]
ResponseStreamHandler[Response Stream Handler]
RequestHandler --> Producer1[Producer]
Consumer1[Consumer] --> ResponseStreamHandler
end
subgraph RabbitMQ["RabbitMQ Message Broker"]
RequestExchange[(Request Exchange)]
RequestQueue[(Request Queue)]
ResponseExchange[(Response Exchange)]
ServerQueues[(Server-Specific Queues)]
AuditRequestQueue[(Request Audit Queue)]
AuditResponseQueue[(Response Audit Queue)]
RequestExchange --> RequestQueue
RequestExchange --> AuditRequestQueue
ResponseExchange --> ServerQueues
ResponseExchange --> AuditResponseQueue
end
subgraph WorkerContainer["Worker Containers (Scalable)"]
Worker1[LLM Worker 1]
Worker2[LLM Worker 2]
Worker3[LLM Worker 3]
end
subgraph AuditContainer["Audit Service Container"]
AuditService[Audit Service]
AuditConsumer[Consumer]
AuditConsumer --> AuditService
end
subgraph LLMProviders["LLM Providers"]
Ollama[Ollama API]
Mock[Mock AiProvider]
end
subgraph Storage["Storage"]
PostgreSQL[(PostgreSQL Database)]
end
Producer1 --> RequestExchange
RequestQueue --> WorkerContainer
Worker1 & Worker2 & Worker3 --> ResponseExchange
Worker1 & Worker2 & Worker3 --> LLMProviders
ServerQueues --> Consumer1
AuditRequestQueue & AuditResponseQueue --> AuditConsumer
AuditService --> PostgreSQL
ResponseStreamHandler --> Client
classDef container fill:#e9f7f2,stroke:#333,stroke-width:2px
classDef queue fill:#ffe6cc,stroke:#333
classDef service fill:#d5e8d4,stroke:#333
classDef database fill:#f8cecc,stroke:#333
classDef client fill:#dae8fc,stroke:#333
class APIServer,WorkerContainer,AuditContainer container
class RequestExchange,ResponseExchange,RequestQueue,ServerQueues,AuditRequestQueue,AuditResponseQueue queue
class API,Worker1,Worker2,Worker3,AuditService,Ollama,Mock service
class PostgreSQL database
class Client client
The system uses a message queue architecture to decouple components and enable horizontal scaling:
- API Layer: Handles client requests and initiates streaming responses
- Request Queue: Buffers incoming requests for processing
- LLM Workers: Process requests by calling LLM providers
- Response Queues: Server-specific queues that route responses back to clients
- Audit Queues: Capture request and response data for logging and analysis
- LLM Providers: Abstractions for different LLM implementations (Ollama, etc.)
- Audit Service: Logs data to PostgreSQL for metrics and monitoring
Request Flow
sequenceDiagram
participant Client as Client
participant API as API Server
participant RQ as Request Queue
participant Worker as LLM Worker
participant LLM as LLM AiProvider
participant RespQ as Response Queue
participant AuditQ as Audit Queue
Client->>API: POST /api/chat
Note over API: Create requestId
API-->>Client: Start SSE stream
API->>RQ: Send request message
Note over RQ: Fanout exchange
RQ->>Worker: Consume request
Worker->>LLM: Process with provider
loop For each token
LLM-->>Worker: Token generation
Worker->>RespQ: Send token chunk
Worker->>AuditQ: Send for logging
RespQ-->>API: Stream to client
API-->>Client: SSE event
end
LLM-->>Worker: Complete response
Worker->>RespQ: Send final response
Worker->>AuditQ: Send metrics
RespQ-->>API: Final response
API-->>Client: End SSE stream
The request flow demonstrates how data moves through the system:
- Client sends a request to the API
- API generates a request ID and creates a server-sent events (SSE) stream
- API sends the request to the request queue
- A worker picks up the request and processes it with the LLM provider
- As tokens are generated, they are sent to:
- The response queue (routed to the specific server)
- The audit queue (for logging)
- The API streams tokens back to the client in real-time
- When complete, the worker sends the final response with metrics
AiProvider Interface Pattern
classDiagram
class LLMProviderInterface {
+init()
+getModels()
+generate(params, onToken, onComplete, onError)
+chat(params, onToken, onComplete, onError)
}
class OllamaProvider {
-config
-ollama
+init()
+getModels()
+generate()
+chat()
}
class MockProvider {
-config
+init()
+getModels()
+generate()
+chat()
}
class ProviderFactory {
+getProvider(type)
+getModels()
}
LLMProviderInterface <|-- OllamaProvider
LLMProviderInterface <|-- MockProvider
ProviderFactory --> LLMProviderInterface
The provider interface pattern enables support for multiple LLM implementations:
- LLMProviderInterface: Defines the contract that all providers must implement
- OllamaProvider: Implementation for Ollama LLM
- MockProvider: Provides mock responses for testing
- ProviderFactory: Creates the appropriate provider based on configuration
This pattern allows easy addition of new LLM providers without changing the core system.
Monitoring and Audit
flowchart LR
AuditQ[Audit Queues] --> AuditSvc[Audit Service]
AuditSvc --> DB[(PostgreSQL)]
DB --> MonitorSvc[Monitoring Service]
MonitorSvc --> Dashboard[Dashboard UI]
subgraph "Audit Data Flow"
direction TB
ReqAudit[Request Audit]
RespAudit[Response Audit]
Metrics[Performance Metrics]
ModelStats[Model Statistics]
WorkerStats[Worker Statistics]
ReqAudit --> Metrics
RespAudit --> Metrics
Metrics --> ModelStats
Metrics --> WorkerStats
end
AuditSvc --> ReqAudit
AuditSvc --> RespAudit
MonitorSvc --> Metrics
style AuditQ fill:#ffe6cc,stroke:#333
style AuditSvc fill:#fff2cc,stroke:#333
style DB fill:#f8cecc,stroke:#333
style MonitorSvc fill:#d5e8d4,stroke:#333
style Dashboard fill:#d4f1f9,stroke:#333
The monitoring and audit subsystem collects performance data and provides visibility:
- Audit Queues: Separate queues for request and response audit data
- Audit Service: Consumes audit messages and stores in PostgreSQL
- Monitoring Service: Extracts metrics and statistics from the database
- Dashboard UI: Visualizes performance metrics and system health
Key metrics collected include:
- Queue depth and processing throughput
- Token generation speed (tokens per second)
- Model usage statistics
- Worker performance metrics
- Request/response history for debugging