Medicare claims, tax returns, PPP applications — the government already holds a closed, mostly clean dataset of every transaction it needs to find fraud. It doesn't need SARs or tips. It just needs to run the query.
A proof of concept for using AI and synthetic SAR data to estimate how many FCPA enforcement actions may have originated from FinCEN suspicious activity reports — and what that tells us about the hidden plumbing of anti-corruption enforcement.
DOJ's new FOCUS initiative wants better data-driven fraud cases. But it keeps its two best enforcement channels — whistleblower tips and data miner analytics — in separate silos. The real opportunity is connecting them.
Public data can source prosecution leads. An open-source fraud-scoring system, run against the full SBA PPP dataset, identified the same lenders, geographies, and loan populations that DOJ prosecuted — using nothing but a downloadable CSV and a standard laptop.
The PPP fraud pipeline worked because the SBA released everything. Medicare's public data is fragmented, de-identified, and missing the features detection needs. Here's what exists on GitHub, where it falls short, and what CMS would need to release to let outside analysts do for healthcare fraud what one Python repo did for PPP.
The previous post described a Medicare fraud backtest nobody had built. I built it. 289 excluded providers across 41 states, matched to pre-exclusion billing data, compared against 3.39 million peers. Thirteen of fifteen features showed statistically significant differences — and the behavioral fingerprint is consistent enough to predict fraud in providers who were never excluded.
A walkthrough of building a Medicare fraud backtest overnight in Claude Code — from a plain-English spec to 289 matched providers across 41 states, a predictive model with AUC 0.79, and out-of-sample validation. Including the three times the pipeline failed, the data duplication bug, and the engineering decisions that shaped the final design.