Managing High-Volume Intake with Celery Task Queues

Within Async Queue Management, the failure mode that costs agencies their statutory deadlines is the intake surge: a synchronous HTTP intake path that works fine at baseline collapses the moment volume jumps 300-800% after an investigative story, a legislative action, or an emergency declaration. This page covers how to decouple ingestion from processing with a Celery task queue so that a spike fills a durable broker instead of timing out the request, every submission is processed exactly once, and no record is silently dropped while the 20-business-day clock is running.

Scenario & Compliance Stakes

A transparency portal that normally takes a few dozen public-records requests a day suddenly receives several thousand in an afternoon after a contract scandal breaks. Under a synchronous architecture, each POST tries to parse, classify, and route inline; worker threads block on OCR calls and database locks, the connection pool saturates, and the load balancer starts returning 502s. Requesters retry, doubling the load, and some submissions are accepted with a 200 but never persisted because the worker died mid-handler.

Every one of those dropped payloads is a compliance problem, not just an availability one. The 20-business-day response window under 5 U.S.C. § 552(a)(6)(A)(i) starts when a request is received, regardless of whether your system managed to store it — so a request that vanished in a timeout cascade is a missed statutory deadline with no record that it ever arrived. A defensible intake path has to accept the submission durably the instant it lands, prove receipt with an immutable timestamp, and process it asynchronously through the rest of the Intake & Routing Workflows pipeline. Celery with a durable broker gives you exactly that boundary: the HTTP layer does nothing but validate and enqueue, and the broker holds the backlog so a spike degrades latency, never durability.

Prerequisites

Python 3.11+ for the structured exception handling and decimal-clean arithmetic used below.
Celery 5.3+ with kombu 5.3+ for the priority-queue and dead-letter-exchange declarations.
RabbitMQ 3.12+ as the broker — preferred over Redis for this workload because it supports x-max-priority queues, persistent (delivery_mode=2) messages, dead-letter exchanges, and quorum queues for zero-data-loss failover.
Redis 7.x as the result backend and idempotency cache (a fast dedup store, not the broker).
Write access to an append-only audit store — a WORM bucket or hash-chained PostgreSQL table — for the receipt and disposition log.
Validation already enforced upstream by your Security Boundary Configuration, so the enqueue path trusts only allowlisted payload shapes.

Implementation

The architecture rests on three guarantees: a durable, priority-aware broker so a surge queues instead of failing; late-acknowledged, idempotent tasks so a worker crash re-queues work instead of losing or duplicating it; and a structured audit line at every state transition. Start with the broker and concurrency configuration. Priority queues require x-max-priority declared per queue, and worker_prefetch_multiplier = 1 is what stops a single worker from greedily reserving the whole backlog during parsing-heavy tasks.

python

# celery_config.py
import os
from kombu import Exchange, Queue

broker_url = os.getenv("CELERY_BROKER_URL", "amqp://foia_broker:secure_pass@mq-cluster:5672/foia_intake")
result_backend = os.getenv("CELERY_RESULT_BACKEND", "redis://redis-cluster:6379/0")

# 1. Strict concurrency: one reserved task per worker so a surge spreads evenly,
#    and hard memory caps so OCR/PDF extraction cannot OOM-kill a worker mid-task.
worker_prefetch_multiplier = 1
worker_max_tasks_per_child = 500            # recycle before heap fragmentation compounds
worker_max_memory_per_child = 1024 * 1024   # 1 GB (KiB) hard limit; child restarts past it

# 2. Priority queues. RabbitMQ needs x-max-priority declared per queue, not globally.
intake_exchange = Exchange("intake_exchange", type="direct", durable=True)
task_queues = (
    Queue("intake_default", intake_exchange, routing_key="default", durable=True),
    Queue("intake_high", intake_exchange, routing_key="high",
          queue_arguments={"x-max-priority": 10}, durable=True),
    Queue("intake_critical", intake_exchange, routing_key="critical",
          queue_arguments={"x-max-priority": 10}, durable=True),
)
task_default_queue = "intake_default"
task_queue_max_priority = 10

# 3. Durability: persist messages so a broker restart mid-surge loses nothing
#    (the receipt is the proof the 20-business-day clock started — 5 U.S.C. 552(a)(6)(A)(i)).
task_acks_late = True
task_reject_on_worker_lost = True
broker_transport_options = {"confirm_publish": True}

The task itself must be idempotent so that broker redelivery or a client retry never produces a duplicate record. Derive an idempotency key from request metadata (a SHA-256 of sender, timestamp, and payload hash), guard on it, and only set the dedup marker after the database transaction commits. acks_late=True plus reject_on_worker_lost=True means an interrupted task is re-queued rather than silently acknowledged and lost.

python

# tasks/intake.py
import json
import logging
import random
from celery import shared_task
from django.core.cache import cache
from django.db import transaction

logger = logging.getLogger("foia.intake.queue")

DEDUP_WINDOW_S = 86400  # 24h: covers client retries without blocking legitimate resubmissions

@shared_task(bind=True, max_retries=3, acks_late=True, reject_on_worker_lost=True,
             rate_limit="100/m")
def process_intake_submission(self, payload: dict, idempotency_key: str,
                              correlation_id: str) -> dict:
    """Persist and route one public-records submission exactly once.

    correlation_id ties the receipt timestamp to the final disposition so a
    compliance officer can prove no request was dropped or double-processed.
    """
    # 1. Idempotency guard: a redelivered or client-retried submission is a no-op.
    if cache.get(idempotency_key):
        logger.info(json.dumps({"event": "duplicate_blocked",
                                "correlation_id": correlation_id, "key": idempotency_key}))
        return {"status": "duplicate", "key": idempotency_key}
    try:
        # 2. Persist and set the dedup marker in the same atomic boundary.
        with transaction.atomic():
            record = create_intake_record(payload)             # writes receipt timestamp
            cache.set(idempotency_key, "processed", timeout=DEDUP_WINDOW_S)
        logger.info(json.dumps({"event": "intake_persisted", "correlation_id": correlation_id,
                                "record_id": record.id, "queue": self.request.delivery_info}))
        # 3. Hand off to scoring/routing only after durable persistence.
        return {"status": "queued_for_processing", "record_id": record.id}
    except Exception as exc:
        # 4. Exponential backoff with full jitter prevents a thundering-herd retry storm.
        delay = min(30 * (2 ** self.request.retries), 3600)
        delay += random.uniform(0, delay * 0.2)
        logger.error(json.dumps({"event": "intake_retry", "correlation_id": correlation_id,
                                 "attempt": self.request.retries, "error": str(exc)}))
        raise self.retry(exc=exc, countdown=int(delay))

With persistence guaranteed, the enqueued record flows into the rest of the pipeline. The ingestion task delegates heterogeneous payloads — multipart MIME, base64 web forms, bulk CSV — to the sanitizers in Email & Form Parsing Pipelines, feeds the validated result into Priority Scoring Algorithms so statutory deadlines and requester class map onto the x-max-priority levels, and routes the scored payload to jurisdiction queues via Department Routing Logic. A high-priority submission bypasses the default queue with process_intake_submission.apply_async(queue="intake_critical", priority=9), and scanned attachments are passed to OCR Processing Pipelines as a separate chained task rather than being processed inline.

Permanently failed tasks (retries exhausted) must land in a dead-letter queue for manual compliance review, never be discarded. Configure the DLX on each queue (x-dead-letter-exchange) and have a reviewer task drain it, so a malformed or repeatedly failing submission is preserved with its full context and chain of custody — submission timestamp, queue assignment, retry count, final disposition, and reviewer ID.

The idempotent lifecycle below — receive, dedup-check, transactional persist, late-ack, retry-with-backoff, or dead-letter — is the contract that keeps the queue defensible under a surge.

Expected Output & Verification

Each state transition emits one JSON audit line. A clean persistence of a critical-priority submission looks like this:

json

{"event": "intake_persisted", "correlation_id": "req-2024-08-14-3391", "record_id": 88412, "queue": {"exchange": "intake_exchange", "routing_key": "critical", "priority": 9}}

Verify three invariants before trusting the queue under load. First, receipt completeness: every accepted HTTP request has a matching intake_persisted or duplicate_blocked line — there is no third “accepted but lost” state, which is the whole point of acks_late. Second, backlog drains, never stalls: on the RabbitMQ side, watch messages_ready (waiting) against messages_unacknowledged (in-flight); a sustained high unacknowledged count means workers are stalling on blocking I/O, and the fix is more workers or async handoff, not a bigger broker. Third, no duplicate records: a redelivered message must produce a duplicate_blocked line, not a second row.

bash

# Inspect in-flight tasks and per-worker memory footprint
celery -A proj inspect active --json | jq '.[] | {hostname, task_name, time_start}'

# Backlog vs in-flight per queue — the surge-health signal
rabbitmqctl list_queues name messages_ready messages_unacknowledged consumers

A reconciliation job should also cross-reference each receipt timestamp against its statutory deadline and flag any record approaching the 20-business-day FOIA threshold (or the shorter state equivalent) while it is still in the queue.

Common Pitfalls

Using hash() or wall-clock-only idempotency keys. Python’s built-in hash() is salted per process, so the same payload yields different keys across worker restarts and the dedup guard silently fails. Derive the key from a stable hashlib.sha256 of normalized metadata, and include a payload hash so two genuinely different requests from the same sender in the same second are not collapsed into one.
acks_early (the default) on a long task. With early acknowledgment the broker marks the task done the instant a worker picks it up; if that worker is OOM-killed mid-OCR, the submission is gone with no trace and the statutory clock keeps running. Always set acks_late=True and reject_on_worker_lost=True so an interrupted task is re-queued.
Retrying without jitter during a surge. A fixed or purely exponential backoff makes every failed task from a downstream 503 retry in lockstep, hammering the recovering service in synchronized waves. Add full jitter (random.uniform(0, delay * 0.2)) so retries spread out, and cap attempts at three before dead-lettering — never retry indefinitely.

Frequently Asked Questions

Why RabbitMQ over Redis as the broker for high-volume FOIA intake?

Redis is an excellent result backend and idempotency cache, but as a Celery broker it lacks native per-queue priority (x-max-priority), durable dead-letter exchanges, and publisher confirms — the three features that make a surge defensible. RabbitMQ persists messages to disk with delivery_mode=2, routes exhausted tasks to a dead-letter queue for compliance review, and with quorum queues guarantees zero data loss during broker failover across nodes. For a workload where a lost message is a missed statutory deadline, those durability guarantees outweigh Redis’s lower latency.

How do I keep urgent requests moving when the default queue is backed up by thousands of routine ones?

Separate the work into distinct queues (intake_critical, intake_high, intake_default) with dedicated workers, rather than relying on priority within a single queue alone. Critical-tier submissions, as classified by your priority-scoring step, are published with apply_async(queue="intake_critical", priority=9) and consumed by workers that never touch the default queue, so a multi-thousand-message routine backlog can never starve a time-sensitive statutory request. Keep worker_prefetch_multiplier = 1 so a worker reserves only one task at a time and cannot hoard the backlog.

What belongs in the audit log to satisfy a chain-of-custody review?

Emit one append-only JSON line at every state transition: receipt (with the timestamp that starts the statutory clock), queue and priority assignment, each retry with its attempt number and error, and the final disposition — persisted, duplicate, or dead-lettered with the reviewer ID. Keying every line on a correlation_id lets a compliance officer reconstruct the full path of any single request and prove it was neither dropped nor double-processed. Retain these logs for the litigation-hold period your jurisdiction requires.

Async Queue Management — the parent system this intake path plugs into
Implementing Dynamic Priority Scoring for Urgent Municipal Requests
Parsing Multi-Format FOIA Submissions with Python Regex
Routing Requests to Correct Departments Using Departmental Mapping Tables
Optimizing Batch OCR Processing for Large Municipal Archives

← Back to Async Queue Management

Managing High-Volume Intake with Celery Task Queues #

Scenario & Compliance Stakes #

Prerequisites #

Implementation #

Expected Output & Verification #

Common Pitfalls #

Frequently Asked Questions #

Related #