Optimizing Batch OCR Processing for Large Municipal Archives

Within Async Batch Processing, the hardest workload a records team faces is the bulk OCR pass over a municipal archive: tens of thousands of multipage TIFFs, compressed PDFs, council minutes, permit applications, and degraded microfilm that all have to become searchable, audit-ready text without losing a single responsive page. This page covers how to size and harden that pass so it produces deterministic output, survives a corrupt scan in the middle of a 10,000-page batch, and never quietly drops a record that a requester is legally entitled to receive.

Scenario & Compliance Stakes

A county clerk receives a request for “all building-permit correspondence for parcel 0451-22 from 1998 to 2014.” The responsive material is a decade of scanned paper sitting in a legacy content store, never indexed. The only way to search it is to OCR it first — and the 20-business-day response clock under 5 U.S.C. § 552(a)(6)(A)(i) is already running while the batch grinds.

Naive sequential OCR fails this scenario in three ways that are each a compliance problem, not just a performance one. Blocking I/O lets latency eat the statutory window. Unbounded memory growth triggers OOM kills that silently abandon whatever pages were in flight, so the agency cannot prove it searched the whole archive. And low-confidence output from degraded microfilm, if published unreviewed, can cause a responsive page to be mis-redacted or a withheld page to leak. A defensible batch OCR pass has to be auditable end to end: every page accounted for, every skipped or low-confidence page logged with its reason, and the original scans left forensically intact for the rest of the Document Retrieval & Parsing pipeline.

Prerequisites

Python 3.11+ for the structured exception groups and tomllib config loading used below.
Tesseract 5.x with the language packs your jurisdiction needs (eng, plus osd for orientation detection on rotated microfilm).
pytesseract 0.3.10+, pdf2image 1.17+, PyMuPDF (fitz) 1.24+, pikepdf 9.x for linearization, and psutil 5.9+ for worker memory introspection.
A distributed task queue — Celery with Redis/RabbitMQ, or RQ — already wired into your Async Batch Processing workers.
A verified staging copy of each source file produced by your Repository Sync Protocols step. OCR must run against the staged copy, never the archival original.
Write access to an append-only audit store (WORM bucket or hash-chained log) and a dead-letter queue for failed pages.

Implementation

The pass is built around three guarantees: validate before you rasterize, chunk at the page level so memory stays flat, and gate every result on a confidence threshold before it is treated as responsive text. Start with a strict ingestion gateway so corrupt or oversized inputs are rejected at the boundary rather than killing a worker mid-batch.

python

import hashlib
import logging
from pathlib import Path

logger = logging.getLogger("municipal.ocr.gateway")

ALLOWED_MIME = {"application/pdf", "image/tiff", "image/jpeg"}
MAX_BYTES = 512 * 1024 * 1024  # 512 MB ceiling guards against decompression-bomb DoS

def validate_and_stage(file_path: Path, mime: str, expected_hash: str | None = None) -> str:
    """Reject malformed inputs at the gateway and return the verified SHA-256.

    Runs on the staged copy only; the archival original is never opened here so
    chain-of-custody for the source artifact is preserved (NARA electronic-records
    guidance; 5 U.S.C. 552(a)(6)(A)(i) deadline depends on no worker crashing).
    """
    if mime not in ALLOWED_MIME:                       # 1. allowlist, not blocklist
        raise ValueError(f"mime_rejected:{mime}")
    if not file_path.exists() or file_path.stat().st_size == 0:
        raise ValueError("zero_byte_or_missing")       # 2. truncated export from legacy store
    if file_path.stat().st_size > MAX_BYTES:
        raise ValueError("oversize_rejected")          # 3. bound memory before rasterizing

    actual = hashlib.sha256(file_path.read_bytes()).hexdigest()
    if expected_hash and actual != expected_hash:      # 4. checksum from sync step must match
        raise ValueError(f"checksum_mismatch:{actual}")
    logger.info('{"event":"staged","file":"%s","sha256":"%s"}', file_path.name, actual)
    return actual

The core worker rasterizes one page at a time, scores Tesseract’s confidence, and emits a single structured JSON audit line per page. Page-level chunking with thread_count=1 is what keeps resident memory flat across a 10,000-page batch — the most common cause of OOM kills is letting pdf2image thread internally and accumulate PIL buffers.

python

import gc
import json
import time
import psutil
import pytesseract
from pdf2image import convert_from_path
from pathlib import Path

logger = logging.getLogger("municipal.ocr.worker")

CONFIDENCE_FLOOR = 60.0          # below this, route to human review, never publish
RSS_SOFT_LIMIT = 0.85           # fraction of cgroup MemoryMax before graceful shutdown

def _mean_confidence(data: dict) -> float:
    scores = [int(c) for c in data["conf"] if c not in ("-1", -1)]
    return sum(scores) / len(scores) if scores else 0.0

def ocr_page(staged_pdf: Path, page_no: int, correlation_id: str, doc_guid: str,
             cgroup_max_bytes: int, lang: str = "eng") -> dict:
    """OCR a single page and return an audit-ready result envelope.

    correlation_id ties the output back to doc_guid + ingestion timestamp so a
    compliance officer can prove no page was skipped, altered, or misattributed.
    """
    started = time.monotonic()
    # 1. Rasterize exactly one page at 300 DPI; thread_count=1 prevents PIL leaks.
    images = convert_from_path(staged_pdf, dpi=300, first_page=page_no,
                               last_page=page_no, thread_count=1)
    try:
        # 2. Structured output gives per-word confidence, not just text.
        data = pytesseract.image_to_data(
            images[0], lang=lang, output_type=pytesseract.Output.DICT,
            config="--oem 3 --psm 6 -c preserve_interword_spaces=1",
        )
        text = " ".join(w for w in data["text"] if w.strip())
        conf = _mean_confidence(data)
        disposition = "PUBLISH" if conf >= CONFIDENCE_FLOOR else "REVIEW_QUEUE"
    finally:
        # 3. Release the image buffer deterministically before the next page.
        for img in images:
            img.close()
        del images
        gc.collect()

    # 4. Circuit breaker: if this worker is near its cgroup cap, signal a recycle.
    rss = psutil.Process().memory_info().rss
    near_oom = cgroup_max_bytes and (rss / cgroup_max_bytes) >= RSS_SOFT_LIMIT

    envelope = {
        "event": "page_ocr",
        "correlation_id": correlation_id,
        "doc_guid": doc_guid,
        "page": page_no,
        "mean_confidence": round(conf, 2),
        "disposition": disposition,
        "chars": len(text),
        "rss_bytes": rss,
        "recycle_worker": near_oom,
        "elapsed_ms": round((time.monotonic() - started) * 1000, 1),
    }
    logger.info(json.dumps(envelope))   # 5. append-only JSON line per page = the audit trail
    return {**envelope, "text": text}

Wire this into the queue with strict guardrails rather than relying on defaults. On Celery, run CPU-bound Tesseract under the prefork pool with --concurrency tied to vCPU count and --max-tasks-per-child=50 so each worker is recycled before heap fragmentation accumulates. Cap concurrent fetches from distributed storage nodes with an asyncio.Semaphore, and configure systemd cgroups (MemoryMax=4G) so a runaway page is contained by the OS, not just the application. Tag every page with the disposition codes and provenance your Metadata Extraction Techniques layer expects, and reuse the layout-tuning work from Tuning Tesseract OCR for Government Form Layouts for the harder form-heavy documents in the archive.

Low-confidence and transient-failure handling is the part that keeps the batch defensible. A page that scores below the floor goes to a human review queue, not to publication; a transient storage timeout retries with jittered backoff up to three times, then lands in the dead-letter queue with full context.

Expected Output & Verification

Each page emits exactly one JSON audit line. A clean run over a degraded microfilm page that still cleared the floor looks like this:

json

{"event": "page_ocr", "correlation_id": "req-0451-22-0007", "doc_guid": "a91f...c2", "page": 7, "mean_confidence": 71.4, "disposition": "PUBLISH", "chars": 1843, "rss_bytes": 612843520, "recycle_worker": false, "elapsed_ms": 894.3}

Verify three invariants before trusting the batch. First, page accounting: the count of page_ocr events for a document must equal its page count — a gap proves a worker died mid-document and the doc must be re-queued. Second, flat memory: rss_bytes should plateau, not climb monotonically; a rising trend means the page-level cleanup is leaking and you will OOM on a long batch. Third, disposition coverage: every page is either PUBLISH or REVIEW_QUEUE — there is no third “silently dropped” state. A quick assertion in your reconciliation job enforces the first invariant:

python

def assert_complete(doc_guid: str, expected_pages: int, audit_lines: list[dict]) -> None:
    seen = {e["page"] for e in audit_lines if e["doc_guid"] == doc_guid}
    missing = set(range(1, expected_pages + 1)) - seen
    if missing:                                  # never close a FOIA item with gaps
        raise AssertionError(f"unaccounted_pages:{doc_guid}:{sorted(missing)}")

Common Pitfalls

pdf2image thread leak on large batches. Leaving thread_count at its default lets PIL accumulate decompressed page buffers across calls, so resident memory climbs until the cgroup kills the worker and abandons in-flight pages. Always pin thread_count=1 and rasterize a single page per call, as shown above.
Treating Tesseract’s -1 confidence as zero. Non-text regions report conf = -1. Averaging those in drags the mean down and floods the review queue with perfectly good pages; filter -1 out before computing the page mean, or your CONFIDENCE_FLOOR gate becomes meaningless.
Non-deterministic OCR across re-runs. Tesseract’s multithreaded LSTM can produce byte-different output on identical input, which breaks the hash-based reconciliation an audit depends on. Pin OMP_THREAD_LIMIT=1 (and MKL_NUM_THREADS=1) in the worker environment so a re-run of the same page yields the same text and the same checksum.

Frequently Asked Questions

How do I size concurrency for a multi-thousand-page archive without OOM kills?

Tie --concurrency to physical vCPU count for CPU-bound Tesseract work (one worker per core is a safe start), cap each worker’s RSS with a systemd cgroup MemoryMax, and recycle workers with --max-tasks-per-child=50 so heap fragmentation never compounds. The recycle_worker flag in the result envelope lets you trigger an early graceful shutdown when a worker crosses 85% of its cap, draining its current page to the queue instead of being OOM-killed mid-page.

What confidence threshold should gate publication for FOIA-responsive scans?

A mean page confidence of 60% is a defensible default floor for degraded municipal scans, but the number matters less than the rule: anything below the floor is held for human review, never published or auto-redacted on the strength of mis-recognized text. Log the score and the REVIEW_QUEUE disposition for every held page so a compliance officer can show the agency neither released nor withheld a record based on unreviewed OCR output.

How do I prove no page was skipped when the batch crashed halfway?

Reconcile the count of page_ocr audit events against each document’s known page count before closing the request. Because every page emits exactly one append-only JSON line keyed by doc_guid and page, a missing page number is provable evidence that a worker died mid-document — re-queue that document and re-run the assertion. The absence of a “silently dropped” disposition is what makes the search defensible under judicial review.

Async Batch Processing — the parent system this OCR pass plugs into
Tuning Tesseract OCR for Government Form Layouts
Extracting Metadata from Scanned Municipal Records with OpenCV
Managing High-Volume Intake with Celery Task Queues

← Back to Document Retrieval & Parsing

Optimizing Batch OCR Processing for Large Municipal Archives #

Scenario & Compliance Stakes #

Prerequisites #

Implementation #

Expected Output & Verification #

Common Pitfalls #

Frequently Asked Questions #

Related #

Optimizing Batch OCR Processing for Large Municipal Archives

Scenario & Compliance Stakes

Prerequisites

Implementation

Expected Output & Verification

Common Pitfalls

Frequently Asked Questions

Related