Login / Register

Multi-Algorithm Name Matching

Beyond keyword search.
Every score traceable to the exact character comparison that produced it.

Beta version. Sanctions lists are subject to change by their issuing authorities.

Why Keyword Search Fails

The compliance gap that exact matching cannot close

Keyword search returns exact matches. Sanctions lists don't contain exact matches. "Владимир Путин" becomes "Vladimir Putin" becomes "Wladimir Putin" depending on transliteration standard. "Kulazhin" and "Kulagin" differ by two characters — and are two different sanctioned persons.

A compliance system that cannot handle this is not a compliance system. The names on the list are not necessarily the names in your records. The gap between them is where sanctions evasion lives.

The Multi-Layer Pipeline

Every comparison runs the same pipeline. Nothing skipped, nothing approximated.

A name query triggers twelve sequential layers, each capturing a different class of name variation. The layers interact — bonuses can only lift a score, guards can only cap it — to produce a final score from 0 to 100.

All Twelve Layers
  • Four String Metrics: Token Set Ratio captures word reordering. Partial Ratio catches truncations. Token Sort Ratio handles name component reordering. Character Ratio measures overall edit distance. Each captures a different class of name variation.
  • Weighted Combination: Metrics combined with tuned weights. When query and candidate differ greatly in length, weights shift automatically — preventing a short name from scoring artificially high as a substring of a long one.
  • Organisation Guards: Common-Word Guard caps scores when overlap consists only of generic terms (bank, group, holdings, international). Token-Overlap Guard caps scores when fewer than 40% of query words appear in the candidate. Both prevent false positives on generic organisation names.
  • Subset Bonus: When one name is a proper subset of the other, the score is boosted proportionally to coverage. "Putin" ⊂ "Vladimir Putin" is rewarded. "bank" ⊂ "Deutsche Bank AG" is not — it fails the Common-Word Guard.
  • Jaro-Winkler Bonus: Rewards shared prefixes. Catches transliterations that preserve the beginning of a name. Reduced for names shorter than 6 characters, where prefix matching is less discriminative.
  • Phonetic Bonus: Rewards pronunciation similarity that character-level metrics miss. Soundex captures consonant-skeleton equivalences (Mueller and Müller both map to M460). Metaphone captures finer phonetic rules (Mohammad and Muhammad). Only applied above a minimum base score to avoid boosting garbage matches.
  • Surname Boost: For person names, independently rewards matching surnames and first names — because a surname match is stronger evidence than a random string overlap.
  • Tertiary Penalty: When biographical data is available, it is compared. Date of birth mismatch reduces the score. Place of birth mismatch reduces the score. Nationality mismatch reduces the score. Gender mismatch reduces the score for person entities. A matching Legal Entity Identifier suppresses all other tertiary checks — it is definitive identity proof. An exact date-of-birth match suppresses secondary mismatches — it is treated as identity confirmation. Maximum combined penalty: capped to avoid over-penalising sparse data.
  • Identifier Match Bonus: When both the query and the candidate share the same Legal Entity Identifier (LEI), and at least one name token overlaps, the score receives a hard-positive boost. This identifier-graph signal actively rewards confirmed identity rather than merely suppressing penalties.
  • Short-Name Cap: Single-word organisation names are capped based on character length. A three-character acronym can score no higher than 70. A seven-character name can score no higher than 95. Prevents inflated confidence on fragments.
  • Machine Learning Override: Multiple LightGBM models, one per entity type, score each match using 27 features including string metrics, Soundex and Metaphone phonetic codes, script detection, and legal suffix equivalence. Machine Learning can only raise a score, never lower it. This is a deliberate design choice: the heuristic pipeline is the floor, not the ceiling.

Thresholds by Entity Type

A single threshold for all entity types produces noise. Separate thresholds produce precision.

Below threshold, a result is discarded. Above threshold, it appears in the review pipeline. Thresholds are not universal — generic organisation names require a higher bar to avoid noise.

Threshold Bands by Entity Type
  • Person / Unknown: Lower threshold. Names are highly variable across transliterations and jurisdictions. The engine must cast a wider net.
  • Organisation / Company / Security: Higher threshold. Generic word overlap is common. The Common-Word Guard and Token-Overlap Guard reduce noise, but a higher base threshold adds a second layer of defence.
  • Vessel / Aircraft: Intermediate threshold. Names are often distinctive but can be translated or abbreviated across registries.

Thresholds are configurable per project. The defaults are tuned against the nightly self-verification results across all active sources.

Zone Classification

Results sorted by Machine Learning confidence. The highest-risk results surface first.

Results above threshold are classified into zones by Machine Learning confidence. Zone assignment determines review order — not whether a result is shown.

Zone Definitions
  • Zone A — Priority: High Machine Learning confidence — likely true positive. Review first. These are the cases that matter most.
  • Zone B — Review: Above decision threshold or strong heuristic score. Manual check recommended. Machine Learning confidence is lower or absent.
  • Zone C — Workbasket: Below both thresholds. Can be bulk-cleared with configurable auto-clear, capped at 85 to prevent accidental clearance of true positives.

The heuristic floor is a safety net. Even if the Machine Learning model is uncertain, a high heuristic score keeps a result in the review queue. Machine Learning cannot suppress a strong name match.

Multi-LayerScoring Pipeline
4ML Models per Entity Type
27ML Features per Match
<50msPer Name

Sanctions Screening Built to Be Audited.

Beta access is free and includes full screening functionality across all official sources, the complete review workflow, and audit-ready exports.

Login / Register