Privacy for legal documents, the parts a regex can't catch
TODO: One paragraph TL;DR. What entity classes legal documents leak that off-the-shelf NER misses, the hybrid approach that closed the gap, and the precision/recall a law firm would actually accept on a Monday morning.
Context
TODO: Why this problem exists at all. Anonymization rules in legal contexts (GDPR-style + jurisdiction-specific). The cases law firms cannot publish without manual scrubbing — and why manual scrubbing is the wrong long-term answer.
What off-the-shelf NER misses
TODO: SpaCy and the popular pre-trained models do well on common entities (people, places, dates) and poorly on the long tail that actually matters in legal docs — case numbers, statute references, deposition exhibit IDs, defendant aliases. Show 2–3 concrete examples.
The pipeline
TODO: A diagram or block list — pre-processing, base NER (SpaCy), domain-specific rule layer (regex + lexicons), reconciliation, post-processing. The interesting decision was the order: rules after NER, not before.
Evaluation
TODO: How you constructed the eval set (annotators, inter-rater agreement, document categories). Precision / recall / F1 numbers per entity class. Be explicit about which classes are still bad — the honest paper is the one a practitioner trusts.
What I'd revisit
TODO: Two critiques. The eval set was small relative to the long-tail of legal doc variation; the rule layer is a maintenance burden that grows every time a new jurisdiction is added.
What this taught me about ML in production
TODO: The lesson that connects to backend work — a model that's 92% accurate sounds great until you do the math on a 10K-document monthly volume. Where rules and learned components belong, and the trade-off between auditability and capability.