FOIA Request Taxonomy Design: A Deterministic Classification Model

Within Core Architecture & Compliance Mapping, the request taxonomy is the authoritative classification layer that turns unstructured public-records intake into validated, machine-readable records that every downstream stage can trust. It is the first deterministic decision in the pipeline: the moment a request is classified, its statutory response clock, its exemption posture, its retention schedule, and its security boundary are all implied. Get the taxonomy wrong and every later control inherits the error — a misclassified record class can start the wrong deadline under 5 U.S.C. § 552(a)(6)(A)(i), route privileged material past redaction, or trigger premature destruction of records under legal hold. This guide builds the taxonomy as a versioned configuration artifact, enforces it with Pydantic v2 at the API boundary, and shows how a validated classification deterministically drives routing, retention, and audit.

Problem Framing & Statutory Requirement

Public-records intake arrives as free text: a citizen email, a portal form, a faxed letter transcribed by a clerk. Free text cannot be reasoned about deterministically, yet the statute demands deterministic action. The federal 20-business-day response window of 5 U.S.C. § 552(a)(6)(A)(i) begins on receipt, and tolling, fee, and exemption rules all hinge on what kind of request this is. A taxonomy that is ambiguous, unversioned, or applied by human judgment at intake produces three failure classes that surface months later under administrative appeal or litigation:

Clock errors. If the subject-matter domain or record class is assigned inconsistently, the deadline computed by the state law compliance frameworks rule engine is computed against the wrong convention, and an agency misses a statutory deadline it believed it had met.
Disclosure errors. Exemption codes that are free-typed (b6, B-6, privacy) cannot be matched against a rule matrix, so privileged or personal material slips past the redaction stage.
Retention errors. If the record class does not map cleanly to a disposition schedule, the records retention scheduling engine cannot calculate a lawful destruction date, risking either premature destruction or unlawful over-retention.

The requirement, therefore, is a taxonomy that is closed (every value drawn from a controlled vocabulary), versioned (every classification stamped with the schema version that produced it), and enforced at the boundary (malformed payloads rejected before they enter the pipeline). The remainder of this page implements exactly that.

Prerequisites & Environment Setup

The taxonomy engine is pure Python with a single third-party dependency for validation. It is intentionally lightweight so it can run inline in the intake API request path without adding latency.

Python 3.11+ — required for datetime.UTC, StrEnum, and the typing features used below.
pydantic>=2.6 — schema enforcement, custom validators, and model_config for strict mode. Install with pip install "pydantic>=2.6".
Standard library: enum, uuid, json, logging, datetime — no other runtime dependencies.
pytest>=8.0 (dev only) — for the regression suite in the verification section.
Access controls: the intake service needs write access only to the append-only audit log and the routing queue; it must hold no credentials for the records store itself. Classification happens before retrieval, so least-privilege separation is enforced through security boundary configuration at this hop.

Treat the taxonomy definition (the controlled vocabularies below) as a configuration artifact under version control, not as inline constants buried in application code. Tag each change with a semantic version and an effective date so that a classification made last quarter can always be re-evaluated against the schema that was authoritative at the time.

Architecture Overview

A production taxonomy enforces strict hierarchical constraints across five immutable layers. Each layer is a controlled vocabulary with machine-readable identifiers, and a valid classification is one path through all five. Free-text categorization at intake is eliminated: the intake form presents API-validated dropdowns that map directly to predefined nodes.

Request Origin & Intake Channel — portal, email, mail, api_gateway
Subject Matter Domain — procurement, personnel, environmental, law_enforcement, infrastructure
Record Class & Media Format — email_records, contracts, policy_documents, financial_audits, multimedia_records
Statutory Exemption Codes — b1 through b9 federal equivalents, or jurisdiction-specific codes such as state_12c
Workflow State — received, scoped, searching, reviewing, redacting, released, appealed

The classification schema must align with jurisdictional mandates. A request tagged under a deliberative-process exemption requires distinct routing logic, redaction templates, and statutory response clocks compared to a privacy exemption. This structural alignment ensures downstream processors interact with the compliance engine without performing manual statutory interpretation at the routing layer. The validated payload produced here is the same canonical record that the department routing logic stage consumes, so the taxonomy contract is effectively the API contract between intake and routing.

Step-by-Step Implementation

1. Define the controlled vocabularies and the validated payload

Model each taxonomy layer as a string enum so that any value outside the closed set is rejected by Pydantic before custom logic runs. The payload model uses extra="forbid" to reject unexpected fields — a common vector for malformed or probing intake submissions — and stamps every record with the schema version that classified it.

python

from __future__ import annotations

import json
import logging
import uuid
from datetime import datetime, UTC
from enum import StrEnum

from pydantic import BaseModel, ConfigDict, Field, ValidationError, field_validator

# --- Structured JSON audit logging (NIST SP 800-53 AU-3: content of audit records) ---
class JSONLogFormatter(logging.Formatter):
    def format(self, record: logging.LogRecord) -> str:
        payload = {
            "timestamp": datetime.now(UTC).isoformat(),
            "level": record.levelname,
            "logger": record.name,
            "message": record.getMessage(),
        }
        # Attach any structured fields passed via logger extra=...
        for key in ("request_id", "schema_version", "routing", "error_details"):
            if hasattr(record, key):
                payload[key] = getattr(record, key)
        return json.dumps(payload)

logger = logging.getLogger("foia_taxonomy_engine")
logger.setLevel(logging.INFO)
_handler = logging.StreamHandler()
_handler.setFormatter(JSONLogFormatter())
logger.addHandler(_handler)

TAXONOMY_SCHEMA_VERSION = "2026.06.0"  # bump on any vocabulary change; stamp every record

class RequestChannel(StrEnum):
    PORTAL = "portal"
    EMAIL = "email"
    MAIL = "mail"
    API_GATEWAY = "api_gateway"

class SubjectDomain(StrEnum):
    PROCUREMENT = "procurement"
    PERSONNEL = "personnel"
    ENVIRONMENTAL = "environmental"
    LAW_ENFORCEMENT = "law_enforcement"
    INFRASTRUCTURE = "infrastructure"

class RecordClass(StrEnum):
    EMAIL = "email_records"
    CONTRACT = "contracts"
    POLICY = "policy_documents"
    FINANCIAL = "financial_audits"
    MULTIMEDIA = "multimedia_records"

class WorkflowState(StrEnum):
    RECEIVED = "received"
    SCOPED = "scoped"
    SEARCHING = "searching"
    REVIEWING = "reviewing"
    REDACTING = "redacting"
    RELEASED = "released"
    APPEALED = "appealed"

class FOIATaxonomyPayload(BaseModel):
    # Reject unknown fields outright; closed vocabularies enforce the rest.
    model_config = ConfigDict(extra="forbid", frozen=True)

    request_id: str = Field(default_factory=lambda: str(uuid.uuid4()))
    schema_version: str = TAXONOMY_SCHEMA_VERSION
    intake_channel: RequestChannel
    subject_domain: SubjectDomain
    record_class: RecordClass
    exemption_codes: list[str] = Field(default_factory=list)
    workflow_state: WorkflowState = WorkflowState.RECEIVED
    # 5 U.S.C. § 552(a)(6)(A)(i): the 20-business-day clock starts at receipt.
    submitted_at: datetime = Field(default_factory=lambda: datetime.now(UTC))

    @field_validator("exemption_codes", mode="before")
    @classmethod
    def validate_exemption_format(cls, value: list[str] | None) -> list[str]:
        if not value:
            return []
        normalized = [code.lower().strip() for code in value]
        for code in normalized:
            # Federal b1-b9 or namespaced state code (e.g. "state_12c"); nothing else.
            is_federal = len(code) == 2 and code[0] == "b" and code[1].isdigit()
            is_state = code.startswith("state_")
            if not (is_federal or is_state):
                raise ValueError(f"Invalid exemption code format: {code!r}")
        return normalized

Expected behavior: a payload with record_class="contract" (singular, not in the vocabulary) raises ValidationError immediately; a payload with exemption_codes=["B-6"] is rejected by the validator with a precise message naming the offending value.

2. Derive deterministic routing from the classification

Routing is a pure function of the taxonomy — no I/O, no heuristics — so the same input always produces the same target queue and priority. This determinism is what makes the routing decision reproducible during an audit years later.

python

def route(self) -> dict[str, str]:
    """Deterministic routing derived solely from the validated taxonomy."""
    routing_map = {
        RecordClass.FINANCIAL: "compliance_finance_queue",
        RecordClass.CONTRACT: "procurement_legal_queue",
        RecordClass.EMAIL: "records_search_engine",
        RecordClass.POLICY: "policy_review_board",
        RecordClass.MULTIMEDIA: "media_redaction_unit",
    }
    # b6 (personal privacy) / b7 (law enforcement) require elevated handling.
    sensitive = {"b6", "b7"} & set(self.exemption_codes)
    return {
        "request_id": self.request_id,
        "target_queue": routing_map[self.record_class],
        "priority": "high" if sensitive else "standard",
        # Statutory clock anchor handed to the deadline engine downstream.
        "statutory_clock_start": self.submitted_at.isoformat(),
    }

This route method is added to the FOIATaxonomyPayload class. The validated payload is the contract handed to priority scoring algorithms, which layer requester history and backlog pressure on top of this baseline priority.

3. Process intake at the API boundary with fail-fast auditing

The boundary function is the only entry point. It validates, emits a structured audit event on both success and failure, and re-raises so the API layer can return a precise 422 rather than a generic 400.

python

def process_intake_payload(raw_data: dict) -> FOIATaxonomyPayload:
    try:
        validated = FOIATaxonomyPayload(**raw_data)
    except ValidationError as exc:
        # AU-3 / AU-9: record the rejection without leaking the raw payload.
        logger.error(
            "Taxonomy validation failed",
            extra={"error_details": exc.errors(), "schema_version": TAXONOMY_SCHEMA_VERSION},
        )
        raise

    routing = validated.route()
    logger.info(
        "Taxonomy validated",
        extra={
            "request_id": validated.request_id,
            "schema_version": validated.schema_version,
            "routing": routing,
        },
    )
    return validated

Expected output: a successful call emits a single JSON log line such as {"timestamp": "...", "level": "INFO", "message": "Taxonomy validated", "request_id": "...", "schema_version": "2026.06.0", "routing": {"target_queue": "procurement_legal_queue", "priority": "standard", ...}} and returns an immutable payload object ready for the routing queue.

Validation & Verification

A taxonomy is only trustworthy if its behavior is asserted by tests that run on every change. Because the vocabularies are configuration, the highest-value tests are those that catch a vocabulary drift before it ships.

python

import pytest
from pydantic import ValidationError

VALID = {
    "intake_channel": "portal",
    "subject_domain": "procurement",
    "record_class": "contracts",
    "exemption_codes": ["b5"],
}

def test_valid_payload_routes_deterministically():
    a = process_intake_payload(dict(VALID))
    b = process_intake_payload(dict(VALID))
    # Routing is a pure function of the taxonomy: same class -> same queue.
    assert a.route()["target_queue"] == b.route()["target_queue"] == "procurement_legal_queue"

def test_unknown_field_is_rejected():
    bad = dict(VALID, requester_ssn="000-00-0000")  # extra="forbid" blocks PII smuggling
    with pytest.raises(ValidationError):
        process_intake_payload(bad)

@pytest.mark.parametrize("code", ["B-6", "privacy", "552(b)(3)", "b99"])
def test_malformed_exemption_codes_rejected(code):
    with pytest.raises(ValidationError):
        process_intake_payload(dict(VALID, exemption_codes=[code]))

def test_sensitive_exemption_raises_priority():
    payload = process_intake_payload(dict(VALID, exemption_codes=["b6"]))
    assert payload.route()["priority"] == "high"

To assert the audit trail itself, capture log records with pytest’s caplog fixture and confirm that every accepted payload emits exactly one Taxonomy validated record carrying a request_id and a schema_version — this is the idempotency-and-traceability check that an auditor will expect to see exercised. Because FOIATaxonomyPayload is frozen=True, re-validating the same canonical record yields an equal object, so replay during recovery never produces a second classification or a second statutory clock.

Troubleshooting & Edge Cases

Duplicate submissions across channels. A requester emails a request and then re-submits the identical text through the portal, producing two request_id values and two statutory clocks for one logical request. Diagnosis: identical normalized subject text within a short window. Fix: derive a deterministic deduplication key from a hash of the normalized requester identity plus request body, and reconcile duplicates before assigning the clock rather than treating each channel as a distinct request.
Exemption codes with non-ASCII or smart-quote artifacts. Mail submissions transcribed from scanned letters often carry b5 followed by a non-breaking space or a Unicode dash inside a state code. Diagnosis: the validator rejects a code that looks correct to a human reviewer. Fix: normalize with unicodedata.normalize("NFKC", code) and strip zero-width characters before the format check, so transcription noise does not block a legitimate classification.
Vocabulary drift after a legislative update. A new state exemption is created, but the controlled vocabulary still rejects it, so valid requests fail at intake. Diagnosis: a spike of state_* rejections in the audit log against the current schema_version. Fix: add the code, bump TAXONOMY_SCHEMA_VERSION, and keep the prior version available so historical records re-validate against the schema that classified them — never mutate old classifications in place.
Litigation-hold conflict at classification. A record class maps to a short disposition schedule while the underlying records are under an active legal hold. Diagnosis: a retention engine attempts to compute a destruction date for a held record. Fix: treat hold status as a gate evaluated after classification but before any disposition action; the taxonomy assigns the class, but the records retention scheduling engine must refuse destruction while a hold flag is set.
Overbroad requests that span multiple domains. A single request names procurement, personnel, and infrastructure records at once, defeating a single-domain classification. Diagnosis: the intake form forces one subject_domain, but the request genuinely spans several. Fix: detect the multi-domain signal at intake and emit a programmatic clarification request before initiating a costly enterprise-wide search, rather than forcing a lossy single-domain tag.

Compliance Verification Checklist

Core Architecture & Compliance Mapping — the parent architecture this classification layer feeds
State Law Compliance Frameworks — converts the classified request into jurisdiction-specific deadlines and exemptions
Records Retention Scheduling — uses record class and domain to compute lawful disposition dates and holds
Security Boundary Configuration — enforces least-privilege separation at the classification hop
How to map state-specific FOIA exemptions to Python dictionaries — the exemption-code mapping detail behind layer four
Department Routing Logic — the routing stage that consumes the validated taxonomy payload

FOIA Request Taxonomy Design: A Deterministic Classification Model #

Problem Framing & Statutory Requirement #

Prerequisites & Environment Setup #

Architecture Overview #

Step-by-Step Implementation #

1. Define the controlled vocabularies and the validated payload #

2. Derive deterministic routing from the classification #

3. Process intake at the API boundary with fail-fast auditing #

Validation & Verification #

Troubleshooting & Edge Cases #

Compliance Verification Checklist #

Related #

FOIA Request Taxonomy Design: A Deterministic Classification Model

Problem Framing & Statutory Requirement

Prerequisites & Environment Setup

Architecture Overview

Step-by-Step Implementation

1. Define the controlled vocabularies and the validated payload

2. Derive deterministic routing from the classification

3. Process intake at the API boundary with fail-fast auditing

Validation & Verification

Troubleshooting & Edge Cases

Compliance Verification Checklist

Related