Case study · Research · 2023

Privacy for legal documents, the parts a regex can't catch

Researcher, Author · 2023~9 min read

Peer-reviewedIJSRCSEIT · Dec 2023Hybrid pipeline · NER + rules

TODO: One paragraph TL;DR. What entity classes legal documents leak that off-the-shelf NER misses, the hybrid approach that closed the gap, and the precision/recall a law firm would actually accept on a Monday morning.

Context

TODO: Why this problem exists at all. Anonymization rules in legal contexts (GDPR-style + jurisdiction-specific). The cases law firms cannot publish without manual scrubbing — and why manual scrubbing is the wrong long-term answer.

What off-the-shelf NER misses

TODO: SpaCy and the popular pre-trained models do well on common entities (people, places, dates) and poorly on the long tail that actually matters in legal docs — case numbers, statute references, deposition exhibit IDs, defendant aliases. Show 2–3 concrete examples.

The pipeline

TODO: A diagram or block list — pre-processing, base NER (SpaCy), domain-specific rule layer (regex + lexicons), reconciliation, post-processing. The interesting decision was the order: rules after NER, not before.

┌────────────┐ │ document │ └─────┬──────┘ ▼ ┌────────────┐ │ SpaCy │ common entities │ NER │ └─────┬──────┘ ▼ ┌────────────┐ │ rule layer │ domain regex + lexicons │ (NLTK + │ │ regex) │ └─────┬──────┘ ▼ ┌────────────┐ │ reconcile │ overlap resolution └─────┬──────┘ ▼ ┌────────────┐ │ output │ masked document + audit log └────────────┘

document → base NER → rule layer → reconciliation → masked output

Evaluation

TODO: How you constructed the eval set (annotators, inter-rater agreement, document categories). Precision / recall / F1 numbers per entity class. Be explicit about which classes are still bad — the honest paper is the one a practitioner trusts.

What I'd revisit

TODO: Two critiques. The eval set was small relative to the long-tail of legal doc variation; the rule layer is a maintenance burden that grows every time a new jurisdiction is added.

What this taught me about ML in production

TODO: The lesson that connects to backend work — a model that's 92% accurate sounds great until you do the math on a 10K-document monthly volume. Where rules and learned components belong, and the trade-off between auditability and capability.