TL;DR
A $40/month Digital Ocean droplet (4 vCPU, 8 GB RAM, 160 GB SSD) is enough to run a complete, production-ready AI stack: Qdrant as the vector DB, Langfuse for LLM observability, LiteLLM Proxy as a unified gateway, Redis 8 for semantic cache and agent memory, Ollama for local models, Nginx + Certbot for TLS, Prometheus + Grafana for metrics, and Restic for backups to Spaces. This article publishes the full docker-compose.yml, the per-container resource budget, the internal network setup and the recovery plan if the droplet goes down.
If your client cannot or will not send data to OpenAI and your monthly AI infrastructure budget is under $100 USD, this is the starting point.
Why self-hosting still makes sense in 2026
The dominant narrative is that "everything is in managed cloud." In practice, three forces push toward self-hosting:
- Regulatory data residency. The EU AI Act (enforceable from August 2, 2026) and sectoral frameworks for healthcare and finance require certain data to remain within a specific jurisdiction. Sending it to a hyperscaler API typically means signing heavy DPAs and annual audits.
- Marginal inference costs. An agent flow with 50 LLM calls per session and a 45% semantic cache hits cuts cost per session in half when Redis lives on the same server as the orchestrator. Intra-region network latency also drops from 30-80 ms to <2 ms.
- Vendor lock-in. Langfuse, Qdrant, LiteLLM, Redis, Ollama and Mastra are all Apache 2.0 or MIT pieces you can move between providers without rewriting code.
What this article does not solve: training or serving large LLMs (>13B parameters) with low latency — for that, managed APIs or dedicated GPUs remain better.
Architecture
┌──────────────────────────────────────────────┐
│ Droplet s-4vcpu-8gb ($40/month) │
│ │
Client (HTTPS) ───► Nginx ──► [ LiteLLM Proxy :4000 ] │
│ │ │
│ ├──► Anthropic / OpenAI / Gemini │
│ │ (rate limit + fallback) │
│ │ │
│ └──► Ollama :11434 (Llama 3.3 8B) │
│ │
│ [ Qdrant :6333 ] [ Redis :6379 ] │
│ │
│ [ Langfuse Web :3000 ] │
│ └──► Langfuse Worker │
│ └──► Postgres :5432 │
│ └──► ClickHouse :8123 │
│ │
│ [ Prometheus :9090 ] [ Grafana :3001 ] │
│ │
│ Restic daemon ──► DO Spaces (S3) │
└──────────────────────────────────────────────┘
Internal network on the Docker Compose bridge; only Nginx and SSH are exposed to the outside world.
The complete docker-compose.yml
This file lives at /opt/numoru-ai/docker-compose.yml. All credentials are read from .env.
version: "3.9"
networks:
core:
driver: bridge
volumes:
qdrant_data:
redis_data:
ollama_data:
lf_postgres:
lf_clickhouse:
lf_minio:
prometheus_data:
grafana_data:
services:
# --- Reverse proxy ---
nginx:
image: nginx:1.27-alpine
restart: unless-stopped
ports: ["80:80", "443:443"]
volumes:
- ./nginx/conf.d:/etc/nginx/conf.d:ro
- ./certs:/etc/letsencrypt:ro
networks: [core]
depends_on: [litellm, langfuse-web, grafana]
# --- Vector database ---
qdrant:
image: qdrant/qdrant:v1.12.5
restart: unless-stopped
volumes: [qdrant_data:/qdrant/storage]
environment:
QDRANT__SERVICE__API_KEY: ${QDRANT_API_KEY}
QDRANT__STORAGE__PERFORMANCE__MAX_SEARCH_THREADS: 2
networks: [core]
deploy:
resources:
limits: { memory: 2g, cpus: "1.5" }
# --- Semantic cache + working memory ---
redis:
image: redis/redis-stack-server:7.4.0-v1
restart: unless-stopped
command: >
redis-stack-server
--requirepass ${REDIS_PASSWORD}
--maxmemory 1gb
--maxmemory-policy allkeys-lru
volumes: [redis_data:/data]
networks: [core]
deploy:
resources:
limits: { memory: 1200m, cpus: "0.75" }
# --- Unified LLM gateway ---
litellm:
image: ghcr.io/berriai/litellm:main-stable
restart: unless-stopped
environment:
LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY}
DATABASE_URL: postgresql://lf:${LF_DB_PASSWORD}@langfuse-db:5432/litellm
LANGFUSE_PUBLIC_KEY: ${LF_PUBLIC_KEY}
LANGFUSE_SECRET_KEY: ${LF_SECRET_KEY}
LANGFUSE_HOST: http://langfuse-web:3000
volumes: [./litellm/config.yaml:/app/config.yaml:ro]
command: ["--config", "/app/config.yaml", "--port", "4000"]
networks: [core]
depends_on: [langfuse-web, redis]
deploy:
resources:
limits: { memory: 512m, cpus: "0.5" }
# --- Local models ---
ollama:
image: ollama/ollama:0.5.4
restart: unless-stopped
volumes: [ollama_data:/root/.ollama]
networks: [core]
deploy:
resources:
limits: { memory: 5g, cpus: "2.5" }
# --- Langfuse (observability) ---
langfuse-db:
image: postgres:16-alpine
restart: unless-stopped
environment:
POSTGRES_USER: lf
POSTGRES_PASSWORD: ${LF_DB_PASSWORD}
POSTGRES_DB: langfuse
volumes: [lf_postgres:/var/lib/postgresql/data]
networks: [core]
langfuse-clickhouse:
image: clickhouse/clickhouse-server:24.8
restart: unless-stopped
environment:
CLICKHOUSE_USER: lf
CLICKHOUSE_PASSWORD: ${LF_CLICKHOUSE_PASSWORD}
CLICKHOUSE_DB: langfuse
volumes: [lf_clickhouse:/var/lib/clickhouse]
networks: [core]
langfuse-web:
image: langfuse/langfuse:3
restart: unless-stopped
environment:
DATABASE_URL: postgresql://lf:${LF_DB_PASSWORD}@langfuse-db:5432/langfuse
CLICKHOUSE_URL: http://langfuse-clickhouse:8123
CLICKHOUSE_USER: lf
CLICKHOUSE_PASSWORD: ${LF_CLICKHOUSE_PASSWORD}
REDIS_CONNECTION_STRING: redis://:${REDIS_PASSWORD}@redis:6379/1
NEXTAUTH_URL: https://langfuse.${DOMAIN}
NEXTAUTH_SECRET: ${LF_NEXTAUTH_SECRET}
SALT: ${LF_SALT}
ENCRYPTION_KEY: ${LF_ENCRYPTION_KEY}
depends_on: [langfuse-db, langfuse-clickhouse, redis]
networks: [core]
langfuse-worker:
image: langfuse/langfuse-worker:3
restart: unless-stopped
environment:
DATABASE_URL: postgresql://lf:${LF_DB_PASSWORD}@langfuse-db:5432/langfuse
CLICKHOUSE_URL: http://langfuse-clickhouse:8123
CLICKHOUSE_USER: lf
CLICKHOUSE_PASSWORD: ${LF_CLICKHOUSE_PASSWORD}
REDIS_CONNECTION_STRING: redis://:${REDIS_PASSWORD}@redis:6379/1
SALT: ${LF_SALT}
ENCRYPTION_KEY: ${LF_ENCRYPTION_KEY}
depends_on: [langfuse-web]
networks: [core]
# --- Metrics ---
prometheus:
image: prom/prometheus:v2.55.1
restart: unless-stopped
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_data:/prometheus
networks: [core]
grafana:
image: grafana/grafana:11.3.1
restart: unless-stopped
environment:
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
volumes: [grafana_data:/var/lib/grafana]
networks: [core]
# --- Backups ---
restic:
image: mazzolino/restic:1.7.3
restart: unless-stopped
environment:
RUN_ON_STARTUP: "false"
BACKUP_CRON: "0 4 * * *"
RESTIC_REPOSITORY: s3:${SPACES_ENDPOINT}/${SPACES_BUCKET}/restic
RESTIC_PASSWORD: ${RESTIC_PASSWORD}
AWS_ACCESS_KEY_ID: ${SPACES_KEY}
AWS_SECRET_ACCESS_KEY: ${SPACES_SECRET}
RESTIC_FORGET_ARGS: "--keep-daily 7 --keep-weekly 4 --keep-monthly 6"
volumes:
- qdrant_data:/mnt/qdrant:ro
- lf_postgres:/mnt/lf_postgres:ro
- lf_clickhouse:/mnt/lf_clickhouse:ro
- redis_data:/mnt/redis:ro
networks: [core]
Memory budget
The s-4vcpu-8gb droplet offers 7.5 GB usable after the kernel. Memory budget:
| Service | Limit | Justification |
|---|---|---|
| Ollama | 5 GB | Llama 3.3 8B Q4_K_M fits in ~4.8 GB |
| Qdrant | 2 GB | 10M vectors of 768 dims with scalar quantization |
| Redis | 1.2 GB | semantic cache + agent state |
| ClickHouse | 1 GB | Langfuse observability |
| Postgres Langfuse | 512 MB | metadata |
| Langfuse web + worker | 1.25 GB | |
| Nginx + Prometheus + Grafana | 400 MB | |
| Total committed | ~11.3 GB |
Important: limits overlap because Ollama only consumes its 5 GB while actively responding. If your load is mostly agents using Claude/GPT through LiteLLM and rarely Ollama, the actual working set stays under 6 GB. If your client needs Ollama as primary, upgrade to
s-4vcpu-16gb($96).
LiteLLM Proxy configuration
File /opt/numoru-ai/litellm/config.yaml:
model_list:
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: claude-opus
litellm_params:
model: anthropic/claude-opus-4-7
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: llama-local
litellm_params:
model: ollama/llama3.3:8b-instruct-q4_K_M
api_base: http://ollama:11434
litellm_settings:
cache: true
cache_params:
type: redis-semantic
host: redis
port: 6379
password: os.environ/REDIS_PASSWORD
similarity_threshold: 0.92
ttl: 86400
success_callback: ["langfuse"]
failure_callback: ["langfuse"]
router_settings:
routing_strategy: latency-based-routing
fallbacks:
- claude-sonnet: [gpt-4o, llama-local]
- gpt-4o: [claude-sonnet, llama-local]
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
database_url: os.environ/DATABASE_URL
This file enables three critical things: semantic cache (45-60% hit rate in customer-support agents), automatic fallback (if Anthropic goes down, traffic shifts to OpenAI or the local model) and Langfuse traces without instrumenting each client.
Nginx with automatic TLS
Recommended subdomains:
api.yourdomain.com→ LiteLLM Proxy (:4000)langfuse.yourdomain.com→ Langfuse web (:3000)grafana.yourdomain.com→ Grafana (:3001)qdrant.yourdomain.com→ Qdrant HTTP (:6333) — protected with basic auth in addition to the API key
File /opt/numoru-ai/nginx/conf.d/api.conf:
server {
listen 443 ssl http2;
server_name api.yourdomain.com;
ssl_certificate /etc/letsencrypt/live/api.yourdomain.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/api.yourdomain.com/privkey.pem;
client_max_body_size 25m;
location / {
proxy_pass http://litellm:4000;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 300s;
}
}
Certificate renewal with a cron job that runs certbot renew --webroot every 3 days.
Backup and restore
restic runs at 4am every day and backs up every volume to Digital Ocean Spaces. Retention policy: 7 daily, 4 weekly, 6 monthly.
Tested restore: on a fresh droplet, git clone of the infrastructure repo + .env + restic restore latest --target / regenerates the entire stack in under 15 minutes. Recovery time objective (RTO): 20 minutes. Recovery point objective (RPO): 24 hours.
Observability: what Langfuse + Grafana give you
- Langfuse: every LLM call with input, output, cost, latency and user. Programmable evals. Versioned prompt management.
- Provisioned Grafana dashboards:
- Tokens/min per model
- Redis cache hit rate (target >40%)
- p50/p95/p99 latency per LiteLLM endpoint
- Qdrant disk usage
- Pending Langfuse worker queues
Real costs (April 2026)
| Item | USD/month |
|---|---|
s-4vcpu-8gb droplet | 40 |
| Spaces (50 GB + medium egress) | 5 |
| Domain + certs | 1 |
| Anthropic/OpenAI (passthrough) | variable |
| Base infrastructure | 46 |
With this stack, a client previously paying $800/month for SaaS RAG + observability + gateway typically drops to $46 + LLM costs (which also fall 45% via semantic cache). ROI is immediate from week two.
Entry/standard tier list price of each managed equivalent vs the unified $46 / mo self-hosted drop-in. Use the full chart to justify migration to a CFO.
- Managed SaaS (USD)
- Self-hosted (USD)
Public pricing pages of Pinecone, LangSmith, OpenRouter, Upstash, February 2026.
Business & commercial impact
What Numoru sells around this stack
The free docker-compose.yml is a marketing asset. The revenue comes from two productized services: a fixed-price installation ($3.5k - 8k) that stands the stack up in a customer's DO account, and a managed operations retainer ($450-1,200 / mo) that takes care of upgrades, backups and incident response. Both are margin-heavy because the actual infra is $46.
Who buys self-hosted AI infra
Stack install + managed ops pricing by buyer (Numoru, 2026)
Public benchmarks supporting the pitch
Qdrant — performance vs hosted alternatives
Langfuse — self-hosted adoption
Digital Ocean — droplet sizing for AI workloads
Illustrative case — agency migrating 7 clients off managed SaaS
Numoru partner agency migrating 7 mid-market clients to shared stack
ROI calculator — migrate off managed SaaS
Single mid-market client: managed SaaS vs Numoru self-hosted (12 months)
| Install (one-time) | −$6,500 |
| Managed SaaS avoided (12 mo × $820) | +$9,840 |
| LLM semantic-cache savings (12 mo × $1,530) | +$18,360 |
| Droplet + Spaces + TLS (12 mo × $46) | −$552 |
| Numoru ops retainer (12 mo × $450) | −$5,400 |
| Engineering time saved (6 h × $95 × 12) | +$6,840 |
| Net year-1 contribution | +$22,588 |
Pricing tiers Numoru sells
- Runs in your DO account
- Compose with Qdrant + Langfuse + LiteLLM + Redis
- TLS + Nginx + backups
- 30-day warranty
- Runbook PDF
- Stripe / card or PO
- Everything in Install
- 24 / 7 monitoring (Grafana + alerts)
- Patching + version upgrades
- Monthly incident review
- Slack shared channel
- Quarterly cost audit
- EU-region deployment (FRA / AMS / PAR)
- HIPAA-equivalent controls
- AI Act technical-docs bundle
- SAML / SSO
- Pen-test coordination
- Annual compliance attestation
Ops retainer scales with droplet count — multi-client agencies get 20% off from 5 droplets, 30% off from 10.
FAQ
Does this stack handle real production traffic?Yes for sustained loads up to ~30 queries per second and 500k LLM calls per day. Beyond that, split Qdrant and Langfuse onto their own droplets.
Can I run Claude or GPT locally?No, they are closed models. Ollama + Llama 3.3 8B / Qwen 2.5 7B is the local route. For sensitive cases mix: generic prompts go to Claude via LiteLLM, prompts with PII go to local Llama.
Why not Kubernetes?A single droplet with Docker Compose is cheaper, easier to operate by an independent consultant, and sufficient up to 500 concurrent users. Kubernetes makes sense from 3 production nodes onward.
Is it AI Act compatible?The stack meets the technical requirements for transparency, log keeping and data residency. Formal compliance also requires DPIA documentation and governance — that's service, not infrastructure.
Can I replace Qdrant with pgvector? For under 100k vectors and low QPS, yes. From 1M vectors and hybrid search in production, Qdrant clearly wins on latency (p95 <50 ms vs 300 ms).
Next steps
The complete repository with docker-compose.yml, Terraform to create the droplet and incident runbooks is published at github.com/numoru-ia/ai-stack-do. The next article in this series covers the tiered memory pattern that uses Redis + Langfuse + Mem0 inside this stack.