Todas as contribuições
Engenhariaself-hostedqdrantlangfuse

Stack de IA self-hosted em um droplet Digital Ocean de $40: Qdrant + Langfuse + LiteLLM + Redis

Docker Compose de produção com Qdrant, Langfuse, LiteLLM Proxy, Redis 8 e Ollama em um droplet de $40. Nginx + Certbot, backups para Spaces, monitoramento com Grafana e Prometheus.

Numoru EngineeringPublicado em 26 de abril de 202618 min de leitura
Compartilhar
Proposta de implementaçãogithub.com/numoru-ia/ai-stack-do

TL;DR

A $40/month Digital Ocean droplet (4 vCPU, 8 GB RAM, 160 GB SSD) is enough to run a complete, production-ready AI stack: Qdrant as the vector DB, Langfuse for LLM observability, LiteLLM Proxy as a unified gateway, Redis 8 for semantic cache and agent memory, Ollama for local models, Nginx + Certbot for TLS, Prometheus + Grafana for metrics, and Restic for backups to Spaces. This article publishes the full docker-compose.yml, the per-container resource budget, the internal network setup and the recovery plan if the droplet goes down.

If your client cannot or will not send data to OpenAI and your monthly AI infrastructure budget is under $100 USD, this is the starting point.

$46
Base infra / month
Droplet + Spaces + TLS
~94%
Cost reduction vs SaaS equivalents
Comparable feature set
-45%
LLM cost via Redis semantic cache
Typical hit rate
30 qps
Sustainable sustained load
Single 4 vCPU droplet

Why self-hosting still makes sense in 2026

The dominant narrative is that "everything is in managed cloud." In practice, three forces push toward self-hosting:

  1. Regulatory data residency. The EU AI Act (enforceable from August 2, 2026) and sectoral frameworks for healthcare and finance require certain data to remain within a specific jurisdiction. Sending it to a hyperscaler API typically means signing heavy DPAs and annual audits.
  2. Marginal inference costs. An agent flow with 50 LLM calls per session and a 45% semantic cache hits cuts cost per session in half when Redis lives on the same server as the orchestrator. Intra-region network latency also drops from 30-80 ms to <2 ms.
  3. Vendor lock-in. Langfuse, Qdrant, LiteLLM, Redis, Ollama and Mastra are all Apache 2.0 or MIT pieces you can move between providers without rewriting code.

What this article does not solve: training or serving large LLMs (>13B parameters) with low latency — for that, managed APIs or dedicated GPUs remain better.

Architecture

                         ┌──────────────────────────────────────────────┐
                         │   Droplet s-4vcpu-8gb ($40/month)            │
                         │                                              │
  Client (HTTPS) ───► Nginx ──► [ LiteLLM Proxy   :4000 ]               │
                         │         │                                    │
                         │         ├──► Anthropic / OpenAI / Gemini     │
                         │         │   (rate limit + fallback)          │
                         │         │                                    │
                         │         └──► Ollama :11434 (Llama 3.3 8B)    │
                         │                                              │
                         │      [ Qdrant :6333 ]  [ Redis :6379 ]       │
                         │                                              │
                         │      [ Langfuse Web :3000 ]                  │
                         │         └──► Langfuse Worker                 │
                         │         └──► Postgres :5432                  │
                         │         └──► ClickHouse :8123                │
                         │                                              │
                         │      [ Prometheus :9090 ] [ Grafana :3001 ]  │
                         │                                              │
                         │      Restic daemon ──► DO Spaces (S3)        │
                         └──────────────────────────────────────────────┘

Internal network on the Docker Compose bridge; only Nginx and SSH are exposed to the outside world.

The complete docker-compose.yml

This file lives at /opt/numoru-ai/docker-compose.yml. All credentials are read from .env.

version: "3.9"

networks:
  core:
    driver: bridge

volumes:
  qdrant_data:
  redis_data:
  ollama_data:
  lf_postgres:
  lf_clickhouse:
  lf_minio:
  prometheus_data:
  grafana_data:

services:
  # --- Reverse proxy ---
  nginx:
    image: nginx:1.27-alpine
    restart: unless-stopped
    ports: ["80:80", "443:443"]
    volumes:
      - ./nginx/conf.d:/etc/nginx/conf.d:ro
      - ./certs:/etc/letsencrypt:ro
    networks: [core]
    depends_on: [litellm, langfuse-web, grafana]

  # --- Vector database ---
  qdrant:
    image: qdrant/qdrant:v1.12.5
    restart: unless-stopped
    volumes: [qdrant_data:/qdrant/storage]
    environment:
      QDRANT__SERVICE__API_KEY: ${QDRANT_API_KEY}
      QDRANT__STORAGE__PERFORMANCE__MAX_SEARCH_THREADS: 2
    networks: [core]
    deploy:
      resources:
        limits: { memory: 2g, cpus: "1.5" }

  # --- Semantic cache + working memory ---
  redis:
    image: redis/redis-stack-server:7.4.0-v1
    restart: unless-stopped
    command: >
      redis-stack-server
      --requirepass ${REDIS_PASSWORD}
      --maxmemory 1gb
      --maxmemory-policy allkeys-lru
    volumes: [redis_data:/data]
    networks: [core]
    deploy:
      resources:
        limits: { memory: 1200m, cpus: "0.75" }

  # --- Unified LLM gateway ---
  litellm:
    image: ghcr.io/berriai/litellm:main-stable
    restart: unless-stopped
    environment:
      LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY}
      DATABASE_URL: postgresql://lf:${LF_DB_PASSWORD}@langfuse-db:5432/litellm
      LANGFUSE_PUBLIC_KEY: ${LF_PUBLIC_KEY}
      LANGFUSE_SECRET_KEY: ${LF_SECRET_KEY}
      LANGFUSE_HOST: http://langfuse-web:3000
    volumes: [./litellm/config.yaml:/app/config.yaml:ro]
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    networks: [core]
    depends_on: [langfuse-web, redis]
    deploy:
      resources:
        limits: { memory: 512m, cpus: "0.5" }

  # --- Local models ---
  ollama:
    image: ollama/ollama:0.5.4
    restart: unless-stopped
    volumes: [ollama_data:/root/.ollama]
    networks: [core]
    deploy:
      resources:
        limits: { memory: 5g, cpus: "2.5" }

  # --- Langfuse (observability) ---
  langfuse-db:
    image: postgres:16-alpine
    restart: unless-stopped
    environment:
      POSTGRES_USER: lf
      POSTGRES_PASSWORD: ${LF_DB_PASSWORD}
      POSTGRES_DB: langfuse
    volumes: [lf_postgres:/var/lib/postgresql/data]
    networks: [core]

  langfuse-clickhouse:
    image: clickhouse/clickhouse-server:24.8
    restart: unless-stopped
    environment:
      CLICKHOUSE_USER: lf
      CLICKHOUSE_PASSWORD: ${LF_CLICKHOUSE_PASSWORD}
      CLICKHOUSE_DB: langfuse
    volumes: [lf_clickhouse:/var/lib/clickhouse]
    networks: [core]

  langfuse-web:
    image: langfuse/langfuse:3
    restart: unless-stopped
    environment:
      DATABASE_URL: postgresql://lf:${LF_DB_PASSWORD}@langfuse-db:5432/langfuse
      CLICKHOUSE_URL: http://langfuse-clickhouse:8123
      CLICKHOUSE_USER: lf
      CLICKHOUSE_PASSWORD: ${LF_CLICKHOUSE_PASSWORD}
      REDIS_CONNECTION_STRING: redis://:${REDIS_PASSWORD}@redis:6379/1
      NEXTAUTH_URL: https://langfuse.${DOMAIN}
      NEXTAUTH_SECRET: ${LF_NEXTAUTH_SECRET}
      SALT: ${LF_SALT}
      ENCRYPTION_KEY: ${LF_ENCRYPTION_KEY}
    depends_on: [langfuse-db, langfuse-clickhouse, redis]
    networks: [core]

  langfuse-worker:
    image: langfuse/langfuse-worker:3
    restart: unless-stopped
    environment:
      DATABASE_URL: postgresql://lf:${LF_DB_PASSWORD}@langfuse-db:5432/langfuse
      CLICKHOUSE_URL: http://langfuse-clickhouse:8123
      CLICKHOUSE_USER: lf
      CLICKHOUSE_PASSWORD: ${LF_CLICKHOUSE_PASSWORD}
      REDIS_CONNECTION_STRING: redis://:${REDIS_PASSWORD}@redis:6379/1
      SALT: ${LF_SALT}
      ENCRYPTION_KEY: ${LF_ENCRYPTION_KEY}
    depends_on: [langfuse-web]
    networks: [core]

  # --- Metrics ---
  prometheus:
    image: prom/prometheus:v2.55.1
    restart: unless-stopped
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    networks: [core]

  grafana:
    image: grafana/grafana:11.3.1
    restart: unless-stopped
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
    volumes: [grafana_data:/var/lib/grafana]
    networks: [core]

  # --- Backups ---
  restic:
    image: mazzolino/restic:1.7.3
    restart: unless-stopped
    environment:
      RUN_ON_STARTUP: "false"
      BACKUP_CRON: "0 4 * * *"
      RESTIC_REPOSITORY: s3:${SPACES_ENDPOINT}/${SPACES_BUCKET}/restic
      RESTIC_PASSWORD: ${RESTIC_PASSWORD}
      AWS_ACCESS_KEY_ID: ${SPACES_KEY}
      AWS_SECRET_ACCESS_KEY: ${SPACES_SECRET}
      RESTIC_FORGET_ARGS: "--keep-daily 7 --keep-weekly 4 --keep-monthly 6"
    volumes:
      - qdrant_data:/mnt/qdrant:ro
      - lf_postgres:/mnt/lf_postgres:ro
      - lf_clickhouse:/mnt/lf_clickhouse:ro
      - redis_data:/mnt/redis:ro
    networks: [core]

Memory budget

The s-4vcpu-8gb droplet offers 7.5 GB usable after the kernel. Memory budget:

ServiceLimitJustification
Ollama5 GBLlama 3.3 8B Q4_K_M fits in ~4.8 GB
Qdrant2 GB10M vectors of 768 dims with scalar quantization
Redis1.2 GBsemantic cache + agent state
ClickHouse1 GBLangfuse observability
Postgres Langfuse512 MBmetadata
Langfuse web + worker1.25 GB
Nginx + Prometheus + Grafana400 MB
Total committed~11.3 GB

Important: limits overlap because Ollama only consumes its 5 GB while actively responding. If your load is mostly agents using Claude/GPT through LiteLLM and rarely Ollama, the actual working set stays under 6 GB. If your client needs Ollama as primary, upgrade to s-4vcpu-16gb ($96).

LiteLLM Proxy configuration

File /opt/numoru-ai/litellm/config.yaml:

model_list:
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: claude-opus
    litellm_params:
      model: anthropic/claude-opus-4-7
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
  - model_name: llama-local
    litellm_params:
      model: ollama/llama3.3:8b-instruct-q4_K_M
      api_base: http://ollama:11434

litellm_settings:
  cache: true
  cache_params:
    type: redis-semantic
    host: redis
    port: 6379
    password: os.environ/REDIS_PASSWORD
    similarity_threshold: 0.92
    ttl: 86400
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]

router_settings:
  routing_strategy: latency-based-routing
  fallbacks:
    - claude-sonnet: [gpt-4o, llama-local]
    - gpt-4o: [claude-sonnet, llama-local]

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL

This file enables three critical things: semantic cache (45-60% hit rate in customer-support agents), automatic fallback (if Anthropic goes down, traffic shifts to OpenAI or the local model) and Langfuse traces without instrumenting each client.

Nginx with automatic TLS

Recommended subdomains:

  • api.yourdomain.com → LiteLLM Proxy (:4000)
  • langfuse.yourdomain.com → Langfuse web (:3000)
  • grafana.yourdomain.com → Grafana (:3001)
  • qdrant.yourdomain.com → Qdrant HTTP (:6333) — protected with basic auth in addition to the API key

File /opt/numoru-ai/nginx/conf.d/api.conf:

server {
  listen 443 ssl http2;
  server_name api.yourdomain.com;
  ssl_certificate     /etc/letsencrypt/live/api.yourdomain.com/fullchain.pem;
  ssl_certificate_key /etc/letsencrypt/live/api.yourdomain.com/privkey.pem;

  client_max_body_size 25m;

  location / {
    proxy_pass http://litellm:4000;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_read_timeout 300s;
  }
}

Certificate renewal with a cron job that runs certbot renew --webroot every 3 days.

Backup and restore

restic runs at 4am every day and backs up every volume to Digital Ocean Spaces. Retention policy: 7 daily, 4 weekly, 6 monthly.

Tested restore: on a fresh droplet, git clone of the infrastructure repo + .env + restic restore latest --target / regenerates the entire stack in under 15 minutes. Recovery time objective (RTO): 20 minutes. Recovery point objective (RPO): 24 hours.

Observability: what Langfuse + Grafana give you

  • Langfuse: every LLM call with input, output, cost, latency and user. Programmable evals. Versioned prompt management.
  • Provisioned Grafana dashboards:
    • Tokens/min per model
    • Redis cache hit rate (target >40%)
    • p50/p95/p99 latency per LiteLLM endpoint
    • Qdrant disk usage
    • Pending Langfuse worker queues

Real costs (April 2026)

ItemUSD/month
s-4vcpu-8gb droplet40
Spaces (50 GB + medium egress)5
Domain + certs1
Anthropic/OpenAI (passthrough)variable
Base infrastructure46

With this stack, a client previously paying $800/month for SaaS RAG + observability + gateway typically drops to $46 + LLM costs (which also fall 45% via semantic cache). ROI is immediate from week two.

Monthly infrastructure cost: SaaS equivalents vs self-hosted stack

Entry/standard tier list price of each managed equivalent vs the unified $46 / mo self-hosted drop-in. Use the full chart to justify migration to a CFO.

$0$70$140$210$280Vector DB(PineconeStandard)LLM observability(LangSmith)Gateway + keys(OpenRouter biz)Cache + rate limit(Upstash Pro)Local inference(Modal / Replicate)Unified self-hostedstack
  • Managed SaaS (USD)
  • Self-hosted (USD)

Public pricing pages of Pinecone, LangSmith, OpenRouter, Upstash, February 2026.

Business & commercial impact

Business & commercial impact

What Numoru sells around this stack

The free docker-compose.yml is a marketing asset. The revenue comes from two productized services: a fixed-price installation ($3.5k - 8k) that stands the stack up in a customer's DO account, and a managed operations retainer ($450-1,200 / mo) that takes care of upgrades, backups and incident response. Both are margin-heavy because the actual infra is $46.

Who buys self-hosted AI infra

Stack install + managed ops pricing by buyer (Numoru, 2026)

Fintech (data residency)
Cannot send PII to OpenAI. Wants local RAG + LLM traces in-region.
$6,500 install + $950 / mo ops
12 mo ops
Healthcare / telemedicine
HIPAA-equivalent compliance. Prefer full on-prem or hybrid.
$8,000 install + $1,200 / mo ops
24 mo ops
Legal / accounting
Confidentiality. AI features without sharing client files with SaaS.
$4,500 install + $650 / mo ops
12 mo
Agencies + boutique SaaS
Trying to cut AI infra bill. Willing to operate themselves after install.
$3,500 install (ops optional)
One-time, 30-day warranty
Enterprise dev teams (EU)
AI Act data residency. Qdrant + Langfuse deployed in Frankfurt / Amsterdam.
$7,500 install + $1,100 / mo ops
24 mo
LATAM government / NGO
Sovereign AI infra, zero foreign SaaS. Self-host everything.
$12,000 install + $1,600 / mo ops
12 mo + training

Public benchmarks supporting the pitch

Public case studyVector DB · Global · 2024

Qdrant — performance vs hosted alternatives

Challenge
Publish reproducible benchmarks comparing managed vector DB services on latency and cost.
Solution
The Qdrant benchmark suite runs identical workloads against Pinecone, Weaviate, Milvus and Qdrant self-hosted.
Results
Qdrant p95 latency
<50 ms
1M+ vector set, single node
Pinecone p95 (Standard)
~300 ms
Equivalent workload
Self-host cost ratio
1 / 12
Vs Pinecone Standard
Public case studyLLM observability · Global · 2024-2025

Langfuse — self-hosted adoption

Challenge
Document adoption of the self-hosted option for customers with data residency needs.
Solution
Langfuse publishes deployment stats and runs a self-hosted edition under MIT license.
Results
Self-hosted deployments
8,000+
Globally reported
Enterprise self-host customers
Major banks + healthtech
Case studies on blog
SaaS → self-hosted migrations
Ongoing
Driven by EU AI Act
Public case studyCloud provider · Global · 2024

Digital Ocean — droplet sizing for AI workloads

Challenge
Define sizing guidance for running production AI components on DO infrastructure.
Solution
DO published tutorials and community case studies covering Docker Compose deployments of vector DB + observability + proxy stacks.
Results
Recommended entry size
4 vCPU / 8 GB
Up to 500 concurrent users
Typical uptime
99.95%
DO droplet SLA
Monthly list price
$40
s-4vcpu-8gb droplet

Illustrative case — agency migrating 7 clients off managed SaaS

Illustrative caseAI services agency · 12 employees · serves 14 enterprise clients · LATAM + USA

Numoru partner agency migrating 7 mid-market clients to shared stack

Baseline
Each client paid $620-1,100 / mo split between Pinecone + LangSmith + OpenRouter + Upstash. Combined bill: $6,800 / mo. Data residency was a growing objection.
Intervention
Numoru installed the self-hosted stack on 7 dedicated droplets (one per client). Agency absorbed ops retainer for first 90 days, then charged clients $450 / mo each.
Projected outcome (12 mo)
Combined managed bill
$6,800 → $322
7 droplets + Spaces
Net monthly savings
$6,478
Passed through after retainer
Agency ops retainer
+$3,150 / mo
7 × $450
Installation fee (one-time)
$45,500
7 × $6,500 setup
Data residency objections
Resolved
All 7 clients renewed
Annualized run-rate lift
+$37,800
Retainer net of infra
Savings math based on public pricing of Pinecone Standard, LangSmith Team and Upstash Pro. Synthetic case — not a specific Numoru client.

ROI calculator — migrate off managed SaaS

Single mid-market client: managed SaaS vs Numoru self-hosted (12 months)

Payback: 3 months
Assumptions
Current SaaS bill (RAG + obs + gateway)$820 / mo
Monthly LLM API spend$3,400 / mo
Semantic-cache hit rate45%
LLM savings from cache$1,530 / mo
Numoru install (one-time)$6,500
Numoru ops retainer$450 / mo
DO droplet + Spaces + TLS$46 / mo
Engineering time saved (backups, upgrades)6 h / mo × $95
Install (one-time)−$6,500
Managed SaaS avoided (12 mo × $820)+$9,840
LLM semantic-cache savings (12 mo × $1,530)+$18,360
Droplet + Spaces + TLS (12 mo × $46)−$552
Numoru ops retainer (12 mo × $450)−$5,400
Engineering time saved (6 h × $95 × 12)+$6,840
Net year-1 contribution+$22,588

Pricing tiers Numoru sells

Install
$3,500one-time
Stand-up only. DIY after that.
  • Runs in your DO account
  • Compose with Qdrant + Langfuse + LiteLLM + Redis
  • TLS + Nginx + backups
  • 30-day warranty
  • Runbook PDF
  • Stripe / card or PO
Install + Ops
$6,500one-time + $650 / mo
Install + 12-month managed ops.
  • Everything in Install
  • 24 / 7 monitoring (Grafana + alerts)
  • Patching + version upgrades
  • Monthly incident review
  • Slack shared channel
  • Quarterly cost audit
Compliance pack
$9,500+one-time + $1,100 / mo
EU data residency or HIPAA scope.
  • EU-region deployment (FRA / AMS / PAR)
  • HIPAA-equivalent controls
  • AI Act technical-docs bundle
  • SAML / SSO
  • Pen-test coordination
  • Annual compliance attestation

Ops retainer scales with droplet count — multi-client agencies get 20% off from 5 droplets, 30% off from 10.

FAQ

Does this stack handle real production traffic?Yes for sustained loads up to ~30 queries per second and 500k LLM calls per day. Beyond that, split Qdrant and Langfuse onto their own droplets.

Can I run Claude or GPT locally?No, they are closed models. Ollama + Llama 3.3 8B / Qwen 2.5 7B is the local route. For sensitive cases mix: generic prompts go to Claude via LiteLLM, prompts with PII go to local Llama.

Why not Kubernetes?A single droplet with Docker Compose is cheaper, easier to operate by an independent consultant, and sufficient up to 500 concurrent users. Kubernetes makes sense from 3 production nodes onward.

Is it AI Act compatible?The stack meets the technical requirements for transparency, log keeping and data residency. Formal compliance also requires DPIA documentation and governance — that's service, not infrastructure.

Can I replace Qdrant with pgvector? For under 100k vectors and low QPS, yes. From 1M vectors and hybrid search in production, Qdrant clearly wins on latency (p95 <50 ms vs 300 ms).

Next steps

The complete repository with docker-compose.yml, Terraform to create the droplet and incident runbooks is published at github.com/numoru-ia/ai-stack-do. The next article in this series covers the tiered memory pattern that uses Redis + Langfuse + Mem0 inside this stack.

Quer resultados assim para sua empresa?

Iniciar conversa
Compartilhar