How to reduce fraud 73% with Machine Learning: a case study

The problem: rising fraud in digital payments

-73%

Fraud rate reduction

In 90 days post-deploy

-74%

Chargebacks

Driving the ROI story

+$2.1M

Annualised net impact

Less loss + more conversion

$60-180k

Typical rollout ticket

Fintech, 3-4 months

A Latin American fintech was processing over 2 million transactions per month. Fraud accounted for 4.2% of total volume — more than double the industry benchmark (1.5–2%). The existing manual rules blocked legitimate transactions (18% false positives) while still letting sophisticated fraud patterns through.

The cost was not only financial: every fraudulent transaction generated chargebacks, eroded user trust and added regulatory risk.

The solution: real-time detection with ML

We designed and deployed a 3-layer fraud detection system:

Layer 1: Feature engineering

80% of an ML model's value lives in its features. We built 147 features in 4 categories:

Transactional: amount, frequency, velocity between transactions, deviation from typical behavior
Device: device fingerprint, geolocation, IP changes, User-Agent
Behavioral: hour of day, day of week, navigation patterns before the transaction
Network: relationships across accounts (same device, same IP, shared beneficiaries)

The network features were the most discriminative. A common fraud pattern involves multiple "clean" accounts that converge on a single beneficiary.

Layer 2: Classification model

We benchmarked 4 algorithms and picked XGBoost for its balance of performance, interpretability and inference speed:

Model	AUC-ROC	Precision	Recall	Latency (p99)
Logistic Regression	0.89	0.82	0.71	2 ms
Random Forest	0.93	0.88	0.79	15 ms
XGBoost	0.96	0.92	0.85	8 ms
Neural Network	0.95	0.90	0.86	45 ms

The neural net had similar recall but 5× the latency — unacceptable for real-time decisions.

Layer 3: Decision system

Not everything is decided by ML. The system combines:

Deterministic rules: instant block for known patterns (confirmed stolen cards, IPs on a blocklist)
Model score: fraud probability (0–1)
Dynamic threshold: tuned per merchant risk segment and transaction amount
Manual review: gray-zone transactions (score 0.4–0.7) routed to an analyst team

Production architecture

End-to-end latency under 100 ms:

Transaction → API Gateway → Feature Store (Redis)
                                    ↓
                            Feature Pipeline (Flink)
                                    ↓
                            XGBoost model (served with BentoML)
                                    ↓
                            Decision Engine → Approve / Reject / Review

Key components:

Feature Store on Redis: pre-computed features with 24h TTL for time-window features
Apache Flink: real-time feature streaming (1h, 24h, 7d windows)
BentoML: model serving with automatic batching and health checks
PostgreSQL: decision log for audit and retraining

Production results

After 4 months in production:

Metric	Before	After	Change
Fraud rate	4.2%	1.1%	-73%
False positives	18%	3.2%	-82%
Decision time	2-5 min (manual)	85 ms	~99.9%
Monthly chargebacks	~8,400	~2,200	-74%

ROI was reached in month two. The reduction in chargebacks plus the lift in conversion (fewer legitimate transactions blocked) generated a net positive impact of $2.1M USD annualized.

Lessons learned

What worked

Network features were the biggest differentiator. Account-relationship patterns are harder to fake than individual features.
Per-segment dynamic thresholds avoided a one-size-fits-all model that would have been too aggressive for low-risk merchants.
Drift monitoring with automatic alerts whenever feature distributions shift meaningfully.

What didn't work at first

Aggressive oversampling (SMOTE) in the initial training produced an over-sensitive model. We switched to focal loss with XGBoost to handle class imbalance.
Geolocation features had low discrimination in LatAm because of widespread VPN use and dynamic-IP mobile networks.

What we would do differently

Implement explainability from day 1 (SHAP values per transaction). Compliance needed it to justify rejections to regulators and it arrived 6 weeks late.
Invest more in synthetic data to train against emerging fraud patterns before they show up in production.

Fraud metrics — baseline vs ML stack (90-day measurement)

Fintech case described above. Same volume, same fraud-actor population; the ML stack replaces rules-only detection.

Pre-ML baseline
Post-ML (90 days)

Anonymized client metrics.

Business & commercial impact

How Numoru productizes this

Fraud-detection rollouts are one of the clearest ROI stories in a fintech — loss + chargeback reduction shows up on the P&L in the first quarter. Numoru sells it as a 3-4 month engagement with a guaranteed metric threshold, plus an MLOps retainer that keeps the model fresh against adversarial drift.

Fraud-detection ticket by buyer (Numoru 2026, USD)

Fintech (wallets / remittances)

Real-time scoring of card & transfer flows.

$80,000 – 180,000

One-time + $6k / mo MLOps

E-commerce (marketplace)

Seller risk + refund abuse.

$45,000 – 110,000

One-time + $3,500 / mo

Insurance

Claims fraud scoring + NLP evidence.

$95,000 – 220,000

One-time + $7k / mo

Banks (digital channel)

Session-level anomaly + device profiling.

$150,000 – 380,000

RFP + 24 mo engagement

Telco

SIM-swap + subscription fraud.

$55,000 – 140,000

One-time + $4k / mo

Public case studyPayments · Global · 2024

Stripe Radar — fraud detection benchmarks

Challenge

Publish impact of ML fraud detection on real merchant cohorts.

Solution

Stripe shares aggregated Radar outcomes in their blog, including ROC curves and merchant-wide fraud reduction.

Results

Fraud reduction, first 90 days

-40 to -80%

Across merchant cohort

False-positive rate

<1%

Radar default threshold

Latency

<100 ms

At checkout

Source: Stripe engineering blog on Radar

Public case studySecurity research · Global · 2024

IBM — Cost of a Data Breach (fintech segment)

Challenge

Report annualised cost of fraud + breach for financial-services firms.

Solution

IBM surveys 600+ breached companies per year, cutting by industry.

Results

Avg breach cost (financial svc)

$5.9M

Up YoY

Cost of identity theft (US)

$43B / yr

Industry total

Share preventable by ML

30-50%

Expert estimate

Source: IBM Cost of a Data Breach 2024

Fintech deploying Numoru fraud stack (12 months)

Payback: < 1

Assumptions

Monthly transaction volume2M

Avg ticket$38

Baseline fraud rate4.2%

Post-rollout fraud rate1.1%

False-positive lost revenue$420k / mo

Post-rollout FP lost revenue$140k / mo

Engagement cost$140,000 one-time

MLOps retainer$6,000 / mo

Engagement (one-time)	−$140,000
MLOps (12 mo × $6k)	−$72,000
Fraud loss avoided (3.1% × 2M × $38 × 12)	+$28,272,000
FP-driven conversion recovered (12 mo × $280k)	+$3,360,000
Net year-1 contribution	+$31,420,000

Diagnosis

$6,500one-time

2-week model assessment.

Baseline metrics audit
Feature-coverage review
Cost-of-fraud analysis
Roadmap + quick wins

Full rollout

$80,000 – 220,000one-time

3-4 months to production.

Feature pipeline + store
XGBoost + GNN models
Decision engine + case mgmt
Explainability (SHAP)
Compliance artifacts

MLOps retainer

$3,500 – 10,000/ mo

Keep it sharp vs adversarial drift.

Weekly retraining
Drift monitoring
Adversarial red-team
Regulatory reporting

Conclusion

ML fraud detection isn't magic — it's feature engineering, careful model selection, and a decision system that combines automation with human supervision.

The most valuable component wasn't the model itself but the real-time feature infrastructure that lets us react in milliseconds to patterns no human team could catch at scale.