Why Not Just Use an LLM?
LLMs are powerful. They are not a compliance engine.
Here is why the architecture matters.
Beta version. Sanctions lists are subject to change by their issuing authorities.
A Fair Question
Why build a complex pipeline when you could just ask an LLM?
The question comes up regularly: why maintain a finely-tuned hybrid pipeline — with string metrics, Machine Learning models, threshold bands, and entity guards — when you could simply pass the query name and the target dataset to a Large Language Model and ask "Are these two entities the same?"
It is a reasonable question. The answer has three parts: speed, auditability, and reproducibility.
Head-to-Head: Hybrid vs. Pure LLM
Across the criteria that matter in production compliance.
| Criterion | Hybrid (Heuristic + ML) | Pure LLM |
|---|---|---|
| Latency | Milliseconds per match | Seconds per match |
| Cost at scale | Very low — minimal compute | Very high — per-token API costs |
| Auditability | Full — every score traceable to the exact comparison | Poor — reasoning varies, hard to document |
| Reproducibility | 100% — same input, same output | Variable — temperature and model updates affect results |
| Regulatory acceptance | High — deterministic rules satisfy BaFin, FCA, OFAC | Low — black-box reasoning is difficult to defend |
| Bulk throughput | Scales to large datasets without API rate limits or token costs | Not viable for watchlist screening at scale |
| Semantic understanding | Limited to programmed features | Excellent — world knowledge, semantic context |
| Hallucination risk | None — deterministic | Real — a missed sanctions hit is a compliance failure |
The Hybrid Approach
Deterministic rules as the foundation. Machine Learning as the ceiling.
The hybrid pipeline uses string metrics and entity guards as its floor. Machine Learning models can only raise a score — never lower it. This is a deliberate design choice: if the heuristic pipeline says "no match", that decision is mathematically traceable and documentable.
Strengths and limitations
Strengths:
- Regulatory defensibility: Regulators including BaFin, FCA, and OFAC require explainability. When the system rejects a match, you can document exactly why — down to the character comparison.
- Mass throughput: Banks and payment processors screen millions of transactions daily in real time. Jaro-Winkler and token set operations cost microseconds. The pipeline scales horizontally without API rate limits.
- Surgical control: When a new source of false positives appears — for example, a newly generic term like "Crypto" — you add one entity guard. The fix applies immediately and completely to all future cases.
Limitations:
- Maintenance overhead: Thresholds and weights must be monitored and tested against a ground-truth dataset. This is why nightly self-verification exists.
- Context blindness: When a company rebrands from "Twitter" to "X", every string metric fails. The system requires metadata bridges — Legal Entity Identifiers or curated alias lists — to handle complete name changes.
Sanctions Screening Built to Be Audited.
Every score traceable. Every decision documentable. No black boxes.
Login / Register