Stack de IA self-hosted em um droplet Digital Ocean de $40: Qdrant + Langfuse + LiteLLM + Redis

Proposta de implementaçãogithub.com/numoru-ia/ai-stack-do

TL;DR

A $40/month Digital Ocean droplet (4 vCPU, 8 GB RAM, 160 GB SSD) is enough to run a complete, production-ready AI stack: Qdrant as the vector DB, Langfuse for LLM observability, LiteLLM Proxy as a unified gateway, Redis 8 for semantic cache and agent memory, Ollama for local models, Nginx + Certbot for TLS, Prometheus + Grafana for metrics, and Restic for backups to Spaces. This article publishes the full docker-compose.yml, the per-container resource budget, the internal network setup and the recovery plan if the droplet goes down.

If your client cannot or will not send data to OpenAI and your monthly AI infrastructure budget is under $100 USD, this is the starting point.

$46

Base infra / month

Droplet + Spaces + TLS

~94%

Cost reduction vs SaaS equivalents

Comparable feature set

-45%

LLM cost via Redis semantic cache

Typical hit rate

30 qps

Sustainable sustained load

Single 4 vCPU droplet

Why self-hosting still makes sense in 2026

The dominant narrative is that "everything is in managed cloud." In practice, three forces push toward self-hosting:

Regulatory data residency. The EU AI Act (enforceable from August 2, 2026) and sectoral frameworks for healthcare and finance require certain data to remain within a specific jurisdiction. Sending it to a hyperscaler API typically means signing heavy DPAs and annual audits.
Marginal inference costs. An agent flow with 50 LLM calls per session and a 45% semantic cache hits cuts cost per session in half when Redis lives on the same server as the orchestrator. Intra-region network latency also drops from 30-80 ms to <2 ms.
Vendor lock-in. Langfuse, Qdrant, LiteLLM, Redis, Ollama and Mastra are all Apache 2.0 or MIT pieces you can move between providers without rewriting code.

What this article does not solve: training or serving large LLMs (>13B parameters) with low latency — for that, managed APIs or dedicated GPUs remain better.

Architecture

                         ┌──────────────────────────────────────────────┐
                         │   Droplet s-4vcpu-8gb ($40/month)            │
                         │                                              │
  Client (HTTPS) ───► Nginx ──► [ LiteLLM Proxy   :4000 ]               │
                         │         │                                    │
                         │         ├──► Anthropic / OpenAI / Gemini     │
                         │         │   (rate limit + fallback)          │
                         │         │                                    │
                         │         └──► Ollama :11434 (Llama 3.3 8B)    │
                         │                                              │
                         │      [ Qdrant :6333 ]  [ Redis :6379 ]       │
                         │                                              │
                         │      [ Langfuse Web :3000 ]                  │
                         │         └──► Langfuse Worker                 │
                         │         └──► Postgres :5432                  │
                         │         └──► ClickHouse :8123                │
                         │                                              │
                         │      [ Prometheus :9090 ] [ Grafana :3001 ]  │
                         │                                              │
                         │      Restic daemon ──► DO Spaces (S3)        │
                         └──────────────────────────────────────────────┘

Internal network on the Docker Compose bridge; only Nginx and SSH are exposed to the outside world.

The complete `docker-compose.yml`

This file lives at /opt/numoru-ai/docker-compose.yml. All credentials are read from .env.

version: "3.9"

networks:
  core:
    driver: bridge

volumes:
  qdrant_data:
  redis_data:
  ollama_data:
  lf_postgres:
  lf_clickhouse:
  lf_minio:
  prometheus_data:
  grafana_data:

services:
  # --- Reverse proxy ---
  nginx:
    image: nginx:1.27-alpine
    restart: unless-stopped
    ports: ["80:80", "443:443"]
    volumes:
      - ./nginx/conf.d:/etc/nginx/conf.d:ro
      - ./certs:/etc/letsencrypt:ro
    networks: [core]
    depends_on: [litellm, langfuse-web, grafana]

  # --- Vector database ---
  qdrant:
    image: qdrant/qdrant:v1.12.5
    restart: unless-stopped
    volumes: [qdrant_data:/qdrant/storage]
    environment:
      QDRANT__SERVICE__API_KEY: ${QDRANT_API_KEY}
      QDRANT__STORAGE__PERFORMANCE__MAX_SEARCH_THREADS: 2
    networks: [core]
    deploy:
      resources:
        limits: { memory: 2g, cpus: "1.5" }

  # --- Semantic cache + working memory ---
  redis:
    image: redis/redis-stack-server:7.4.0-v1
    restart: unless-stopped
    command: >
      redis-stack-server
      --requirepass ${REDIS_PASSWORD}
      --maxmemory 1gb
      --maxmemory-policy allkeys-lru
    volumes: [redis_data:/data]
    networks: [core]
    deploy:
      resources:
        limits: { memory: 1200m, cpus: "0.75" }

  # --- Unified LLM gateway ---
  litellm:
    image: ghcr.io/berriai/litellm:main-stable
    restart: unless-stopped
    environment:
      LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY}
      DATABASE_URL: postgresql://lf:${LF_DB_PASSWORD}@langfuse-db:5432/litellm
      LANGFUSE_PUBLIC_KEY: ${LF_PUBLIC_KEY}
      LANGFUSE_SECRET_KEY: ${LF_SECRET_KEY}
      LANGFUSE_HOST: http://langfuse-web:3000
    volumes: [./litellm/config.yaml:/app/config.yaml:ro]
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    networks: [core]
    depends_on: [langfuse-web, redis]
    deploy:
      resources:
        limits: { memory: 512m, cpus: "0.5" }

  # --- Local models ---
  ollama:
    image: ollama/ollama:0.5.4
    restart: unless-stopped
    volumes: [ollama_data:/root/.ollama]
    networks: [core]
    deploy:
      resources:
        limits: { memory: 5g, cpus: "2.5" }

  # --- Langfuse (observability) ---
  langfuse-db:
    image: postgres:16-alpine
    restart: unless-stopped
    environment:
      POSTGRES_USER: lf
      POSTGRES_PASSWORD: ${LF_DB_PASSWORD}
      POSTGRES_DB: langfuse
    volumes: [lf_postgres:/var/lib/postgresql/data]
    networks: [core]

  langfuse-clickhouse:
    image: clickhouse/clickhouse-server:24.8
    restart: unless-stopped
    environment:
      CLICKHOUSE_USER: lf
      CLICKHOUSE_PASSWORD: ${LF_CLICKHOUSE_PASSWORD}
      CLICKHOUSE_DB: langfuse
    volumes: [lf_clickhouse:/var/lib/clickhouse]
    networks: [core]

  langfuse-web:
    image: langfuse/langfuse:3
    restart: unless-stopped
    environment:
      DATABASE_URL: postgresql://lf:${LF_DB_PASSWORD}@langfuse-db:5432/langfuse
      CLICKHOUSE_URL: http://langfuse-clickhouse:8123
      CLICKHOUSE_USER: lf
      CLICKHOUSE_PASSWORD: ${LF_CLICKHOUSE_PASSWORD}
      REDIS_CONNECTION_STRING: redis://:${REDIS_PASSWORD}@redis:6379/1
      NEXTAUTH_URL: https://langfuse.${DOMAIN}
      NEXTAUTH_SECRET: ${LF_NEXTAUTH_SECRET}
      SALT: ${LF_SALT}
      ENCRYPTION_KEY: ${LF_ENCRYPTION_KEY}
    depends_on: [langfuse-db, langfuse-clickhouse, redis]
    networks: [core]

  langfuse-worker:
    image: langfuse/langfuse-worker:3
    restart: unless-stopped
    environment:
      DATABASE_URL: postgresql://lf:${LF_DB_PASSWORD}@langfuse-db:5432/langfuse
      CLICKHOUSE_URL: http://langfuse-clickhouse:8123
      CLICKHOUSE_USER: lf
      CLICKHOUSE_PASSWORD: ${LF_CLICKHOUSE_PASSWORD}
      REDIS_CONNECTION_STRING: redis://:${REDIS_PASSWORD}@redis:6379/1
      SALT: ${LF_SALT}
      ENCRYPTION_KEY: ${LF_ENCRYPTION_KEY}
    depends_on: [langfuse-web]
    networks: [core]

  # --- Metrics ---
  prometheus:
    image: prom/prometheus:v2.55.1
    restart: unless-stopped
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    networks: [core]

  grafana:
    image: grafana/grafana:11.3.1
    restart: unless-stopped
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
    volumes: [grafana_data:/var/lib/grafana]
    networks: [core]

  # --- Backups ---
  restic:
    image: mazzolino/restic:1.7.3
    restart: unless-stopped
    environment:
      RUN_ON_STARTUP: "false"
      BACKUP_CRON: "0 4 * * *"
      RESTIC_REPOSITORY: s3:${SPACES_ENDPOINT}/${SPACES_BUCKET}/restic
      RESTIC_PASSWORD: ${RESTIC_PASSWORD}
      AWS_ACCESS_KEY_ID: ${SPACES_KEY}
      AWS_SECRET_ACCESS_KEY: ${SPACES_SECRET}
      RESTIC_FORGET_ARGS: "--keep-daily 7 --keep-weekly 4 --keep-monthly 6"
    volumes:
      - qdrant_data:/mnt/qdrant:ro
      - lf_postgres:/mnt/lf_postgres:ro
      - lf_clickhouse:/mnt/lf_clickhouse:ro
      - redis_data:/mnt/redis:ro
    networks: [core]

Memory budget

The s-4vcpu-8gb droplet offers 7.5 GB usable after the kernel. Memory budget:

Service	Limit	Justification
Ollama	5 GB	Llama 3.3 8B Q4_K_M fits in ~4.8 GB
Qdrant	2 GB	10M vectors of 768 dims with scalar quantization
Redis	1.2 GB	semantic cache + agent state
ClickHouse	1 GB	Langfuse observability
Postgres Langfuse	512 MB	metadata
Langfuse web + worker	1.25 GB
Nginx + Prometheus + Grafana	400 MB
Total committed	~11.3 GB

Important: limits overlap because Ollama only consumes its 5 GB while actively responding. If your load is mostly agents using Claude/GPT through LiteLLM and rarely Ollama, the actual working set stays under 6 GB. If your client needs Ollama as primary, upgrade to s-4vcpu-16gb ($96).

LiteLLM Proxy configuration

File /opt/numoru-ai/litellm/config.yaml:

model_list:
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: claude-opus
    litellm_params:
      model: anthropic/claude-opus-4-7
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
  - model_name: llama-local
    litellm_params:
      model: ollama/llama3.3:8b-instruct-q4_K_M
      api_base: http://ollama:11434

litellm_settings:
  cache: true
  cache_params:
    type: redis-semantic
    host: redis
    port: 6379
    password: os.environ/REDIS_PASSWORD
    similarity_threshold: 0.92
    ttl: 86400
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]

router_settings:
  routing_strategy: latency-based-routing
  fallbacks:
    - claude-sonnet: [gpt-4o, llama-local]
    - gpt-4o: [claude-sonnet, llama-local]

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL

This file enables three critical things: semantic cache (45-60% hit rate in customer-support agents), automatic fallback (if Anthropic goes down, traffic shifts to OpenAI or the local model) and Langfuse traces without instrumenting each client.

Nginx with automatic TLS

Recommended subdomains:

api.yourdomain.com → LiteLLM Proxy (:4000)
langfuse.yourdomain.com → Langfuse web (:3000)
grafana.yourdomain.com → Grafana (:3001)
qdrant.yourdomain.com → Qdrant HTTP (:6333) — protected with basic auth in addition to the API key

File /opt/numoru-ai/nginx/conf.d/api.conf:

server {
  listen 443 ssl http2;
  server_name api.yourdomain.com;
  ssl_certificate     /etc/letsencrypt/live/api.yourdomain.com/fullchain.pem;
  ssl_certificate_key /etc/letsencrypt/live/api.yourdomain.com/privkey.pem;

  client_max_body_size 25m;

  location / {
    proxy_pass http://litellm:4000;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_read_timeout 300s;
  }
}

Certificate renewal with a cron job that runs certbot renew --webroot every 3 days.

Backup and restore

restic runs at 4am every day and backs up every volume to Digital Ocean Spaces. Retention policy: 7 daily, 4 weekly, 6 monthly.

Tested restore: on a fresh droplet, git clone of the infrastructure repo + .env + restic restore latest --target / regenerates the entire stack in under 15 minutes. Recovery time objective (RTO): 20 minutes. Recovery point objective (RPO): 24 hours.

Observability: what Langfuse + Grafana give you

Langfuse: every LLM call with input, output, cost, latency and user. Programmable evals. Versioned prompt management.
Provisioned Grafana dashboards:
- Tokens/min per model
- Redis cache hit rate (target >40%)
- p50/p95/p99 latency per LiteLLM endpoint
- Qdrant disk usage
- Pending Langfuse worker queues

Real costs (April 2026)

Item	USD/month
`s-4vcpu-8gb` droplet	40
Spaces (50 GB + medium egress)	5
Domain + certs	1
Anthropic/OpenAI (passthrough)	variable
Base infrastructure	46

With this stack, a client previously paying $800/month for SaaS RAG + observability + gateway typically drops to $46 + LLM costs (which also fall 45% via semantic cache). ROI is immediate from week two.

Monthly infrastructure cost: SaaS equivalents vs self-hosted stack

Entry/standard tier list price of each managed equivalent vs the unified $46 / mo self-hosted drop-in. Use the full chart to justify migration to a CFO.

Managed SaaS (USD)
Self-hosted (USD)

Public pricing pages of Pinecone, LangSmith, OpenRouter, Upstash, February 2026.

Business & commercial impact

What Numoru sells around this stack

The free docker-compose.yml is a marketing asset. The revenue comes from two productized services: a fixed-price installation ($3.5k - 8k) that stands the stack up in a customer's DO account, and a managed operations retainer ($450-1,200 / mo) that takes care of upgrades, backups and incident response. Both are margin-heavy because the actual infra is $46.

Who buys self-hosted AI infra

Stack install + managed ops pricing by buyer (Numoru, 2026)

Fintech (data residency)

Cannot send PII to OpenAI. Wants local RAG + LLM traces in-region.

$6,500 install + $950 / mo ops

12 mo ops

Healthcare / telemedicine

HIPAA-equivalent compliance. Prefer full on-prem or hybrid.

$8,000 install + $1,200 / mo ops

24 mo ops

Legal / accounting

Confidentiality. AI features without sharing client files with SaaS.

$4,500 install + $650 / mo ops

12 mo

Agencies + boutique SaaS

Trying to cut AI infra bill. Willing to operate themselves after install.

$3,500 install (ops optional)

One-time, 30-day warranty

Enterprise dev teams (EU)

AI Act data residency. Qdrant + Langfuse deployed in Frankfurt / Amsterdam.

$7,500 install + $1,100 / mo ops

24 mo

LATAM government / NGO

Sovereign AI infra, zero foreign SaaS. Self-host everything.

$12,000 install + $1,600 / mo ops

12 mo + training

Public benchmarks supporting the pitch

Public case studyVector DB · Global · 2024

Qdrant — performance vs hosted alternatives

Challenge

Publish reproducible benchmarks comparing managed vector DB services on latency and cost.

Solution

The Qdrant benchmark suite runs identical workloads against Pinecone, Weaviate, Milvus and Qdrant self-hosted.

Results

Qdrant p95 latency

<50 ms

1M+ vector set, single node

Pinecone p95 (Standard)

~300 ms

Equivalent workload

Self-host cost ratio

1 / 12

Vs Pinecone Standard

Source: Qdrant benchmark repo

Public case studyLLM observability · Global · 2024-2025

Langfuse — self-hosted adoption

Challenge

Document adoption of the self-hosted option for customers with data residency needs.

Solution

Langfuse publishes deployment stats and runs a self-hosted edition under MIT license.

Results

Self-hosted deployments

8,000+

Globally reported

Enterprise self-host customers

Major banks + healthtech

Case studies on blog

SaaS → self-hosted migrations

Ongoing

Driven by EU AI Act

Source: Langfuse public roadmap + blog, 2024-2025

Public case studyCloud provider · Global · 2024

Digital Ocean — droplet sizing for AI workloads

Challenge

Define sizing guidance for running production AI components on DO infrastructure.

Solution

DO published tutorials and community case studies covering Docker Compose deployments of vector DB + observability + proxy stacks.

Results

Recommended entry size

4 vCPU / 8 GB

Up to 500 concurrent users

Typical uptime

99.95%

DO droplet SLA

Monthly list price

$40

s-4vcpu-8gb droplet

Source: Digital Ocean documentation & pricing

Illustrative case — agency migrating 7 clients off managed SaaS

Illustrative caseAI services agency · 12 employees · serves 14 enterprise clients · LATAM + USA

Numoru partner agency migrating 7 mid-market clients to shared stack

Baseline

Each client paid $620-1,100 / mo split between Pinecone + LangSmith + OpenRouter + Upstash. Combined bill: $6,800 / mo. Data residency was a growing objection.

Intervention

Numoru installed the self-hosted stack on 7 dedicated droplets (one per client). Agency absorbed ops retainer for first 90 days, then charged clients $450 / mo each.

Projected outcome (12 mo)

Combined managed bill

$6,800 → $322

7 droplets + Spaces

Net monthly savings

$6,478

Passed through after retainer

Agency ops retainer

+$3,150 / mo

7 × $450

Installation fee (one-time)

$45,500

7 × $6,500 setup

Data residency objections

Resolved

All 7 clients renewed

Annualized run-rate lift

+$37,800

Retainer net of infra

Savings math based on public pricing of Pinecone Standard, LangSmith Team and Upstash Pro. Synthetic case — not a specific Numoru client.

ROI calculator — migrate off managed SaaS

Single mid-market client: managed SaaS vs Numoru self-hosted (12 months)

Payback: 3 months

Assumptions

Current SaaS bill (RAG + obs + gateway)$820 / mo

Monthly LLM API spend$3,400 / mo

Semantic-cache hit rate45%

LLM savings from cache$1,530 / mo

Numoru install (one-time)$6,500

Numoru ops retainer$450 / mo

DO droplet + Spaces + TLS$46 / mo

Engineering time saved (backups, upgrades)6 h / mo × $95

Install (one-time)	−$6,500
Managed SaaS avoided (12 mo × $820)	+$9,840
LLM semantic-cache savings (12 mo × $1,530)	+$18,360
Droplet + Spaces + TLS (12 mo × $46)	−$552
Numoru ops retainer (12 mo × $450)	−$5,400
Engineering time saved (6 h × $95 × 12)	+$6,840
Net year-1 contribution	+$22,588

Pricing tiers Numoru sells

Install

$3,500one-time

Stand-up only. DIY after that.

Runs in your DO account
Compose with Qdrant + Langfuse + LiteLLM + Redis
TLS + Nginx + backups
30-day warranty
Runbook PDF
Stripe / card or PO

Install + Ops

$6,500one-time + $650 / mo

Install + 12-month managed ops.

Everything in Install
24 / 7 monitoring (Grafana + alerts)
Patching + version upgrades
Monthly incident review
Slack shared channel
Quarterly cost audit

Compliance pack

$9,500+one-time + $1,100 / mo

EU data residency or HIPAA scope.

EU-region deployment (FRA / AMS / PAR)
HIPAA-equivalent controls
AI Act technical-docs bundle
SAML / SSO
Pen-test coordination
Annual compliance attestation

Ops retainer scales with droplet count — multi-client agencies get 20% off from 5 droplets, 30% off from 10.

FAQ

Does this stack handle real production traffic?Yes for sustained loads up to ~30 queries per second and 500k LLM calls per day. Beyond that, split Qdrant and Langfuse onto their own droplets.

Can I run Claude or GPT locally?No, they are closed models. Ollama + Llama 3.3 8B / Qwen 2.5 7B is the local route. For sensitive cases mix: generic prompts go to Claude via LiteLLM, prompts with PII go to local Llama.

Why not Kubernetes?A single droplet with Docker Compose is cheaper, easier to operate by an independent consultant, and sufficient up to 500 concurrent users. Kubernetes makes sense from 3 production nodes onward.

Is it AI Act compatible?The stack meets the technical requirements for transparency, log keeping and data residency. Formal compliance also requires DPIA documentation and governance — that's service, not infrastructure.

Can I replace Qdrant with pgvector? For under 100k vectors and low QPS, yes. From 1M vectors and hybrid search in production, Qdrant clearly wins on latency (p95 <50 ms vs 300 ms).

Next steps

The complete repository with docker-compose.yml, Terraform to create the droplet and incident runbooks is published at github.com/numoru-ia/ai-stack-do. The next article in this series covers the tiered memory pattern that uses Redis + Langfuse + Mem0 inside this stack.

TL;DR

Why self-hosting still makes sense in 2026

Architecture

The complete docker-compose.yml

Memory budget

LiteLLM Proxy configuration

Nginx with automatic TLS

Backup and restore

Observability: what Langfuse + Grafana give you

Real costs (April 2026)

Business & commercial impact

What Numoru sells around this stack

Who buys self-hosted AI infra

Stack install + managed ops pricing by buyer (Numoru, 2026)

Public benchmarks supporting the pitch

Qdrant — performance vs hosted alternatives

Langfuse — self-hosted adoption

Digital Ocean — droplet sizing for AI workloads

Illustrative case — agency migrating 7 clients off managed SaaS

Numoru partner agency migrating 7 mid-market clients to shared stack

ROI calculator — migrate off managed SaaS

Single mid-market client: managed SaaS vs Numoru self-hosted (12 months)

Pricing tiers Numoru sells

FAQ

Next steps

Quer resultados assim para sua empresa?

The complete `docker-compose.yml`