The problem: rising fraud in digital payments
A Latin American fintech was processing over 2 million transactions per month. Fraud accounted for 4.2% of total volume — more than double the industry benchmark (1.5–2%). The existing manual rules blocked legitimate transactions (18% false positives) while still letting sophisticated fraud patterns through.
The cost was not only financial: every fraudulent transaction generated chargebacks, eroded user trust and added regulatory risk.
The solution: real-time detection with ML
We designed and deployed a 3-layer fraud detection system:
Layer 1: Feature engineering
80% of an ML model's value lives in its features. We built 147 features in 4 categories:
- Transactional: amount, frequency, velocity between transactions, deviation from typical behavior
- Device: device fingerprint, geolocation, IP changes, User-Agent
- Behavioral: hour of day, day of week, navigation patterns before the transaction
- Network: relationships across accounts (same device, same IP, shared beneficiaries)
The network features were the most discriminative. A common fraud pattern involves multiple "clean" accounts that converge on a single beneficiary.
Layer 2: Classification model
We benchmarked 4 algorithms and picked XGBoost for its balance of performance, interpretability and inference speed:
| Model | AUC-ROC | Precision | Recall | Latency (p99) |
|---|---|---|---|---|
| Logistic Regression | 0.89 | 0.82 | 0.71 | 2 ms |
| Random Forest | 0.93 | 0.88 | 0.79 | 15 ms |
| XGBoost | 0.96 | 0.92 | 0.85 | 8 ms |
| Neural Network | 0.95 | 0.90 | 0.86 | 45 ms |
The neural net had similar recall but 5× the latency — unacceptable for real-time decisions.
Layer 3: Decision system
Not everything is decided by ML. The system combines:
- Deterministic rules: instant block for known patterns (confirmed stolen cards, IPs on a blocklist)
- Model score: fraud probability (0–1)
- Dynamic threshold: tuned per merchant risk segment and transaction amount
- Manual review: gray-zone transactions (score 0.4–0.7) routed to an analyst team
Production architecture
End-to-end latency under 100 ms:
Transaction → API Gateway → Feature Store (Redis)
↓
Feature Pipeline (Flink)
↓
XGBoost model (served with BentoML)
↓
Decision Engine → Approve / Reject / Review
Key components:
- Feature Store on Redis: pre-computed features with 24h TTL for time-window features
- Apache Flink: real-time feature streaming (1h, 24h, 7d windows)
- BentoML: model serving with automatic batching and health checks
- PostgreSQL: decision log for audit and retraining
Production results
After 4 months in production:
| Metric | Before | After | Change |
|---|---|---|---|
| Fraud rate | 4.2% | 1.1% | -73% |
| False positives | 18% | 3.2% | -82% |
| Decision time | 2-5 min (manual) | 85 ms | ~99.9% |
| Monthly chargebacks | ~8,400 | ~2,200 | -74% |
ROI was reached in month two. The reduction in chargebacks plus the lift in conversion (fewer legitimate transactions blocked) generated a net positive impact of $2.1M USD annualized.
Lessons learned
What worked
- Network features were the biggest differentiator. Account-relationship patterns are harder to fake than individual features.
- Per-segment dynamic thresholds avoided a one-size-fits-all model that would have been too aggressive for low-risk merchants.
- Drift monitoring with automatic alerts whenever feature distributions shift meaningfully.
What didn't work at first
- Aggressive oversampling (SMOTE) in the initial training produced an over-sensitive model. We switched to focal loss with XGBoost to handle class imbalance.
- Geolocation features had low discrimination in LatAm because of widespread VPN use and dynamic-IP mobile networks.
What we would do differently
- Implement explainability from day 1 (SHAP values per transaction). Compliance needed it to justify rejections to regulators and it arrived 6 weeks late.
- Invest more in synthetic data to train against emerging fraud patterns before they show up in production.
Fintech case described above. Same volume, same fraud-actor population; the ML stack replaces rules-only detection.
- Pre-ML baseline
- Post-ML (90 days)
Anonymized client metrics.
Business & commercial impact
How Numoru productizes this
Fraud-detection rollouts are one of the clearest ROI stories in a fintech — loss + chargeback reduction shows up on the P&L in the first quarter. Numoru sells it as a 3-4 month engagement with a guaranteed metric threshold, plus an MLOps retainer that keeps the model fresh against adversarial drift.
Fraud-detection ticket by buyer (Numoru 2026, USD)
Stripe Radar — fraud detection benchmarks
IBM — Cost of a Data Breach (fintech segment)
Fintech deploying Numoru fraud stack (12 months)
| Engagement (one-time) | −$140,000 |
| MLOps (12 mo × $6k) | −$72,000 |
| Fraud loss avoided (3.1% × 2M × $38 × 12) | +$28,272,000 |
| FP-driven conversion recovered (12 mo × $280k) | +$3,360,000 |
| Net year-1 contribution | +$31,420,000 |
- Baseline metrics audit
- Feature-coverage review
- Cost-of-fraud analysis
- Roadmap + quick wins
- Feature pipeline + store
- XGBoost + GNN models
- Decision engine + case mgmt
- Explainability (SHAP)
- Compliance artifacts
- Weekly retraining
- Drift monitoring
- Adversarial red-team
- Regulatory reporting
Conclusion
ML fraud detection isn't magic — it's feature engineering, careful model selection, and a decision system that combines automation with human supervision.
The most valuable component wasn't the model itself but the real-time feature infrastructure that lets us react in milliseconds to patterns no human team could catch at scale.