Parsing Multi-Format FOIA Submissions with Python Regex

This task lives inside Email & Form Parsing Pipelines: before a request can be scored, routed, or deadline-stamped, the raw submission has to be turned into structured fields, and regex is the deterministic extraction layer that does it. Here you build that extractor for the three formats public-records inboxes actually receive — threaded email, legacy web-form exports, and OCR’d PDFs — without letting it leak forwarded data or stall a worker.

Scenario & Compliance Stakes

Public-records intake almost never arrives as clean structured data. The same week an agency might receive a FOIA request as a forwarded email chain with three layers of > quoting, a CSV-style dump from a decommissioned web form, and a scanned PDF whose OCR pass inserted zero-width spaces and broke lines mid-sentence. Each format hides the same four statutory elements — who is asking, what records they want, and whether they claim a fee waiver or expedited processing under 5 U.S.C. § 552(a)(3) and § 552(a)(6)(E). A naive re.search over that mess routinely grabs the wrong contact: the agency staffer who forwarded the message, not the citizen who filed it.

That misextraction is not cosmetic. The requester’s email is what stamps the 20-business-day response clock under 5 U.S.C. § 552(a)(6)(A)(i) and what every later acknowledgement and appeal is addressed to. Capture the forwarding clerk’s @agency.gov address instead and the statutory clock anchors to the wrong party, the requester never receives an acknowledgement, and the agency runs silently past its deadline — a constructive denial the requester can appeal or litigate. A second failure mode is operational: a greedy, backtracking pattern fed a multi-megabyte OCR dump can pin a CPU core for seconds (a ReDoS stall), starving the intake workers and pushing every queued request closer to its deadline. The extractor below is built to fail closed on both: it anchors to the real requester and it cannot catastrophically backtrack.

Prerequisites

Python 3.11+ — for re.VERBOSE/re.MULTILINE flags and modern typing syntax (the standard-library re module is all you need; no third-party regex engine).
A normalization assumption — input is decoded to str (UTF-8) before it reaches the extractor; binary attachments, base64 blobs, and MIME parts are stripped upstream in the parsing pipeline.
A maintained internal-domain allowlist — the set of agency mail domains (.gov, .mil, and any contractor domains) used to recognise and discard routing noise.
Write access to a structured audit log sink — stdout JSON in development, forwarded to your SIEM or append-only store in production; never log raw PII.
A bounded-execution wrapper — a way to cap per-call regex CPU time (a worker timeout, signal.alarm, or a subprocess budget) so one pathological payload cannot stall the pool.

Implementation

The module compiles three verbose, explicitly bounded patterns — every quantifier has an upper limit, which is what makes the patterns ReDoS-safe — and wraps extraction in a normaliser and an internal-domain filter. The numbered comments mark the compliance-critical lines.

python

import re
import logging
from typing import Optional, TypedDict

# Structured audit logging: every extraction must be reconstructable on appeal.
# Never log raw PII — emails are hashed and phones redacted before they are written.
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
)
logger = logging.getLogger("foia.intake.regex_engine")

# (1) Agency-owned domains. A match resolving here is routing noise, not a requester.
INTERNAL_DOMAINS = (".gov", ".mil")

# (2) Every quantifier is upper-bounded ({2,50}, {10,5000}) so the engine cannot
#     catastrophically backtrack on a large OCR dump — the ReDoS guard is structural.
REQUESTER_PATTERN = re.compile(
    r"""
    (?:^|\n)\s*
    (?:Requester|Submitted\sby|From|Contact|Applicant)\s*:\s*
    (?P<name>[A-Z][A-Za-z\s.\-]{2,50}?)
    [\s,|]+
    (?P<email>[A-Za-z0-9._%+\-]{1,64}@[A-Za-z0-9.\-]{1,255}\.[A-Za-z]{2,24})
    (?:[\s|]+
        (?P<phone>(?:\+?1[\s.\-]?)?\(?\d{3}\)?[\s.\-]?\d{3}[\s.\-]?\d{4})
    )?
    """,
    re.VERBOSE | re.IGNORECASE | re.MULTILINE,
)

DESCRIPTION_PATTERN = re.compile(
    r"""
    (?:^|\n)\s*
    (?:Description|Records\sRequested|Subject|Scope)\s*:\s*
    (?P<description>[^\n]{10,5000})
    """,
    re.VERBOSE | re.IGNORECASE | re.MULTILINE,
)

# (3) 5 U.S.C. § 552(a)(3) (fee waiver) and § 552(a)(6)(E) (expedited processing):
#     both flags drive downstream statutory handling, so detection is deterministic.
FEE_EXPEDITE_PATTERN = re.compile(
    r"""
    (?:^|\n)\s*
    (?:
        (?P<fee_waiver>fee\swaiver|waiver\sof\sfees|public\sinterest\swaiver)
        |
        (?P<expedite>expedited\sprocessing|urgency\sto\sinform|imminent\sthreat)
    )
    """,
    re.VERBOSE | re.IGNORECASE | re.MULTILINE,
)


class FoiaFields(TypedDict):
    requester: Optional[str]
    email: Optional[str]
    phone: Optional[str]
    description: Optional[str]
    fee_waiver: bool
    expedite: bool


def _hash_email(email: str) -> str:
    """Stable, non-reversible token for audit logs — keeps PII out of the log stream."""
    import hashlib
    return hashlib.sha256(email.encode("utf-8")).hexdigest()[:12]


def _normalize(raw_text: str) -> str:
    # (4) Strip zero-width chars and normalise CRLF/whitespace from OCR and email.
    text = raw_text.replace("", "").replace("", "").replace("\r\n", "\n")
    return re.sub(r"[ \t]{2,}", " ", text)


def extract_foia_fields(raw_text: str) -> FoiaFields:
    """Deterministically extract the four statutory elements from one submission."""
    if not isinstance(raw_text, str):
        # (5) Fail closed: a non-string payload must halt, never silently return blanks.
        raise TypeError("Intake payload must be a decoded str")

    normalized = _normalize(raw_text)
    fields: FoiaFields = {
        "requester": None, "email": None, "phone": None,
        "description": None, "fee_waiver": False, "expedite": False,
    }

    # (6) Iterate candidates and skip internal addresses so the requester clock anchors
    #     to the citizen, not the forwarding clerk — § 552(a)(6)(A)(i) depends on this.
    for m in REQUESTER_PATTERN.finditer(normalized):
        email = m.group("email")
        if email.lower().endswith(INTERNAL_DOMAINS):
            logger.info("routing_noise_skipped email_hash=%s pos=%d",
                        _hash_email(email), m.start())
            continue
        fields["requester"] = m.group("name").strip()
        fields["email"] = email
        fields["phone"] = "REDACTED" if m.group("phone") else None
        logger.info("requester_extracted email_hash=%s has_phone=%s",
                    _hash_email(email), bool(m.group("phone")))
        break

    desc = DESCRIPTION_PATTERN.search(normalized)
    if desc:
        fields["description"] = desc.group("description").strip()

    fee = FEE_EXPEDITE_PATTERN.search(normalized)
    if fee:
        fields["fee_waiver"] = bool(fee.group("fee_waiver"))
        fields["expedite"] = bool(fee.group("expedite"))

    # (7) Mandatory-field check: a request with no requester email cannot be clocked.
    if not fields["email"] or not fields["description"]:
        logger.warning("incomplete_extraction missing_email=%s missing_desc=%s",
                       fields["email"] is None, fields["description"] is None)

    return fields

The validated fields dict is the contract the rest of intake depends on: the fee_waiver and expedite booleans feed Priority Scoring Algorithms, the description is tokenised by Department Routing Logic to pick a custodial office, and the whole payload is serialised onto Async Queue Management for downstream processing. When the source is a scanned PDF, run the bytes through OCR Processing Pipelines first so this extractor receives text, not an image.

Expected Output & Verification

Feeding a forwarded email whose top quote is an agency clerk and whose body holds the real request produces a clean, PII-free audit trail and the right fields:

text

2026-06-27 09:14:02,118 | INFO | foia.intake.regex_engine | routing_noise_skipped email_hash=4f1a9c2b7e03 pos=12
2026-06-27 09:14:02,121 | INFO | foia.intake.regex_engine | requester_extracted email_hash=9d3e7a01ff52 has_phone=True

python

sample = (
    "From: Records Desk, desk@city.gov\n"
    "Requester: Jane Q. Public, jane.public@example.com | 555-123-4567\n"
    "Records Requested: All inspection reports for 100 Main St, 2023-2024\n"
    "Fee waiver requested in the public interest.\n"
)
fields = extract_foia_fields(sample)
assert fields["email"] == "jane.public@example.com"   # not desk@city.gov
assert fields["fee_waiver"] is True
assert fields["phone"] == "REDACTED"                  # PII never surfaced raw

The load-bearing assertion is the first one: the .gov address is skipped and the citizen’s address wins, so the response clock anchors to the right party. For regression safety, replay a corpus of historical submissions through extract_foia_fields in a dry run and diff extracted emails against the known-correct requester before any production deploy.

Common Pitfalls

ReDoS on a large OCR dump. Python’s re has no possessive quantifiers or atomic groups, so an unbounded pattern like (\s+\w+)+ over a multi-megabyte scan can backtrack for seconds and pin a worker. Keep every quantifier upper-bounded ({2,50}, {10,5000}) as in the patterns above, and still wrap each call in a CPU-time budget so a pathological payload trips a timeout instead of starving the pool.
Anchoring to forwarded-chain contacts. A bare REQUESTER_PATTERN.search() returns the first match, which on a forwarded email is usually the agency clerk who relayed it. Use finditer plus the internal-domain skip (step 6) so the citizen’s address is the one that stamps the deadline — getting this wrong is a silent missed-deadline bug, not a crash.
OCR whitespace and zero-width characters breaking matches. Scanned PDFs routinely carry / and doubled spaces that split jane.public @ example.com or wreck the : delimiter. Normalise before matching (step 4); skipping normalisation produces intermittent “no requester found” failures that only show up on scanned batches and are miserable to reproduce.

FAQ

Why regex instead of an LLM or a dedicated email parser for this?

Because intake field extraction is a compliance control, and a control has to be deterministic and auditable. The same submission must always yield the same email, description, and flags, and you must be able to point an Inspector General at the exact pattern version and match position that produced a field. Bounded regex gives you that reproducibility and a fixed, provable execution cost; a probabilistic model gives you neither. Where a layout is genuinely unstructured — a scanned form with no labels — that is an OCR/layout problem to solve upstream, not a reason to make the field extractor non-deterministic.

How do I keep the patterns correct when a web form or statute changes wording?

Version the compiled patterns the same way you version any compliance config. Keep a registry that pins each pattern to a version string, deploy a new version alongside the old during a transition window, and tag every extraction log line with the pattern version that produced it. When a fee-waiver phrasing or a form label changes, you add a new alternation branch and bump the version rather than editing in place — so a request parsed last year can still be reproduced against the pattern that was live when it arrived.

What should happen when a mandatory field cannot be extracted?

Fail closed and escalate to a human, never silently drop the request. The extractor logs incomplete_extraction (step 7) when email or description is missing and passes the payload through to a human-in-the-loop review queue with its original text intact. A request the parser cannot understand is still a request under 5 U.S.C. § 552(a)(3), and the statutory clock starts when the agency receives it — so an unparseable submission must be triaged, not discarded.

Email & Form Parsing Pipelines — the parent stage these extracted fields feed into.
Priority Scoring Algorithms — consumes the fee_waiver and expedite flags to escalate urgent requests.
Department Routing Logic — tokenises the extracted description to pick a custodial office.
OCR Processing Pipelines — produces the text this extractor reads from scanned PDFs.
Intake & Routing Workflows — the system this extraction step belongs to.

Parsing Multi-Format FOIA Submissions with Python Regex #

Scenario & Compliance Stakes #

Prerequisites #

Implementation #

Expected Output & Verification #

Common Pitfalls #

FAQ #

Related #