Nightly Self-Verification
Recall proven every night. Not assumed.
Beta version. Sanctions lists are subject to change by their issuing authorities.
The Requirement
Complete recall on exact-name matches. Not a target — a requirement.
A screening system must prove that it works — not just produce results and hope. The standard is unambiguous: if a designated person cannot be found by their own official name, the system has failed.
There is no acceptable recall threshold below complete for exact-name matches. Anything less means a sanctioned actor can appear in your data and go undetected. The nightly self-verification enforces this standard automatically.
Test Design
Eight sources × six configurations × five mutation levels = 144 sessions per night.
Every entity from every active source is screened against itself under multiple conditions. The test is not a sample — it is exhaustive. Every entity. Every night.
Configurations and Mutation Levels
Configurations:
- Baseline: Full data — name, entity type, date of birth, nationality. Tests the system under ideal conditions.
- Without entity type: Tests resilience when type is unknown — a common real-world scenario when screening from unstructured data.
- Without tertiary data: Tests name-only matching strength. The baseline for any system that receives a name without biographical context.
- Production mode: Re-evaluates results with production DOB rescue thresholds. Tests whether entities that narrowly miss the threshold in baseline would be recovered by the DOB rescue pass used in live screening.
- Without ML: Disables the Machine Learning override entirely. Isolates the heuristic pipeline performance — the score floor without ML assistance.
- Cross-source validation: Screens each source's entities against all other sources simultaneously. Tests whether a designation on one list is detectable via a different list's entry for the same actor.
Mutation Levels:
- M0 — Unmodified: Original name as it appears on the list. Must achieve complete recall. Any miss at M0 is a critical failure.
- M1 — One mutation: Random character swap, omission, insertion, replacement, diacritic change, or transliteration. Models a single transcription error or variant spelling.
- M2 — Two mutations: Two mutations applied simultaneously. Tests the degradation curve — how quickly recall falls as input quality degrades.
- M3 — Name reorder: Token order is reversed or first and last tokens are swapped. Tests resilience to name component reordering — a common variant when names cross cultural conventions (given name first vs. family name first).
- M4 — Partial name: Only the first or last token is kept, simulating partial name entry. Tests whether the system can still surface a candidate when only a fragment of the full name is provided.
Every mutation is deterministically seeded. Every false negative is traceable to the exact mutation that caused it.
What Is Measured
Five metrics per session. Every false negative analysed, not just counted.
The output of each verification session is a structured dataset — not a pass/fail flag. Every metric is preserved for trend analysis.
Metrics Collected Per Session
- True Positive Rate (Recall): Proportion of entities that find themselves as the top result. The primary metric. M0 recall must be 100%.
- Mean Reciprocal Rank: At which position does the entity appear in its own top-5 results? A rank-1 result is ideal. Rank 2 or lower flags a concern.
- Score Distribution: Mean, median, and percentiles across all sessions. Detects drift — a gradual score decline that precedes recall failures.
- False Negative Analysis: Every missed entity logged with exact mutation applied and the actual top-1 result returned. Makes failure modes visible and debuggable.
- Wilson Confidence Intervals: Statistical bounds on the true positive rate at 95% confidence. Prevents overstating reliability on small source sizes.
Results are exported as a structured XLSX report after each nightly cycle.
What the Results Mean
M0 recall is complete. M1 exceeds 99%. M2 degrades — and the report shows exactly why.
M0 recall is complete. Every entity, across all sources, is findable by its official name — every night. M1 recall (one mutation) exceeds 99%. The system tolerates one-character errors in real screening.
M2 recall degrades further — as expected. Two simultaneous mutations challenge any string similarity system, and the results document exactly which mutation types cause failures and why. That information is used to tune the matching pipeline between releases.
Sanctions Screening Built to Be Audited.
Beta access is free and includes full screening functionality across all official sources, the complete review workflow, and audit-ready exports.
Login / Register