Why Not Just Use an LLM?

LLMs are powerful. They are not a compliance engine. Here is why the architecture matters.

Sanctions lists are subject to change by their issuing authorities.

The Question

Why build a deterministic engine when an LLM could just answer?

It comes up often: why run a hybrid engine — string metrics, ML models, threshold bands, entity guards — when you could hand the query and the target dataset to an LLM and ask "are these the same person?"

Three reasons: speed, auditability, reproducibility.

For the record. This page is about the screening engine, which is fully deterministic. Some adjacent features do use the Anthropic Claude API — Chat (Q&A on entities), Decision Summary (per-screening synthesis), and the ER Guardian audit. They are optional and can be disabled per project. The screening itself never depends on an LLM.

Head-to-Head: Hybrid vs. Pure LLM

Across the criteria that matter in production compliance.

Criterion	Hybrid (Heuristic + ML)	Pure LLM
Latency	Milliseconds per match	Seconds per match
Cost at scale	Very low — minimal compute	Very high — per-token API costs
Auditability	Full — every score traceable to the exact comparison	Poor — reasoning varies, hard to document
Reproducibility	100% — same input, same output	Variable — temperature and model updates affect results
Regulatory acceptance	High — deterministic rules satisfy BaFin, FCA, OFAC	Low — black-box reasoning is difficult to defend
Bulk throughput	Scales to large datasets without API rate limits or token costs	Not viable for watchlist screening at scale
Semantic understanding	Limited to programmed features	Excellent — world knowledge, semantic context
Hallucination risk	None — deterministic	Real — a missed sanctions hit is a compliance failure

The Hybrid Approach

Deterministic rules at the floor, machine learning at the ceiling.

String metrics and entity guards form the floor. Machine-learning models can only raise a score, never lower it. The reason is regulatory: when the heuristic pipeline says "no match", that decision is mathematically traceable, line by line.

Strengths and limitations

Strengths:

Regulatory defensibility: BaFin, FCA, and OFAC all require explainability. When the system rejects a match, you can show exactly why, down to the character comparison.
Mass throughput: Banks and payment processors screen millions of transactions a day in real time. String comparisons cost microseconds and scale horizontally — no external API, no rate limits.
Surgical control: When a new false-positive pattern appears — say, a newly generic token like "Crypto" — we add a guard. It applies platform-wide on the next screening.

Limitations:

Maintenance overhead: Thresholds and weights need ongoing monitoring against a ground-truth dataset. That is what nightly self-verification is for.
Context blindness: When a company rebrands from "Twitter" to "X", every string metric fails. We bridge that with metadata — LEIs and curated alias lists — to absorb full name changes.

Sanctions Screening Built to Be Audited.

Every score traceable, every decision documented, no black boxes.