Syncing Legacy Document Management Systems with Modern REST APIs

Within Repository Sync Protocols, the hardest single integration most public-sector teams own is wiring a decades-old document management system — a proprietary DMS, a mainframe-adjacent metadata store, or a hierarchical file vault — to a contemporary JSON/REST endpoint. This page covers how to build that one-directional sync so it produces deterministic, replayable output, never duplicates or skips a responsive record, and writes a defensible audit trail for every document that crosses the boundary.

Scenario & Compliance Stakes

A records office is migrating active FOIA fulfillment off a 1990s-era DMS that has no ETags, no cursor pagination, and no idempotency tokens — only composite alphanumeric record keys and a Last-Modified column that the vendor populates inconsistently. The modern REST target expects clean UTF-8 JSON, content-addressed deduplication, and TLS 1.3. The sync layer sits between them, and from the moment a request is logged the agency is on the clock: the federal Freedom of Information Act sets a 20-business-day window on the substantive response (5 U.S.C. § 552(a)(6)(A)(i)), and state analogues are often shorter.

The failure modes here are compliance failures, not merely performance ones. Drop a record during a concurrent legacy update and the production is incomplete. Ingest the same record twice because a retry fired and the disclosure is double-counted. Lose the link between a file and the hash observed at extraction and the redaction becomes indefensible in litigation. Because the legacy side offers no native change feed, every one of those guarantees has to be reconstructed at the application layer — deterministic identity, content-keyed idempotency, and an append-only record of each transfer of custody under NIST SP 800-53 AU-2 and AU-12.

Prerequisites

Python 3.11+ for tomllib config loading and exception groups.
aiohttp 3.9+ for streaming, connection-pooled async transport, and tenacity 8.x for backoff/circuit-breaker primitives (or hand-rolled equivalents).
Read-only credentials scoped to exactly one legacy repository (least privilege) and a write credential for the REST target that cannot mutate the audit store.
A documented legacy schema: which columns are immutable (doc_id, created_dt, agency_code), which carry the change signal (Last-Modified, a monotonic sequence counter, or audit-log rows), and the source character encoding.
An append-only audit store (WORM bucket or hash-chained log, tamper-evident under AU-9) plus a dead-letter queue for records that exhaust retries.
A normalization step already in place — your Document Retrieval & Parsing pipeline should expose Metadata Extraction Techniques for field flattening and OCR Processing Pipelines for image-only records before payloads reach this sync.

Implementation

The sync runs in two phases so it never pulls a payload it does not need. Phase one performs a lightweight metadata diff using whatever change signal the legacy side exposes; phase two fetches and transmits only the records whose content hash has actually diverged. Identity is reconstructed deterministically: legacy primary keys map to a UUIDv5 derived from immutable attributes, so the same record always resolves to the same canonical ID across every run and worker.

Identity and content fingerprinting come first, because every later guarantee — dedup, audit, replay — keys off them. The canonical ID must be derived only from fields the legacy system can never change after creation, or the same document will resync as a “new” record.

python

import hashlib
import json
import logging
import uuid
from datetime import datetime, timezone
from typing import Any

logger = logging.getLogger("records.sync")

# UUIDv5 over IMMUTABLE fields only: a stable identity survives every resync so the
# REST target can dedupe (NIST SP 800-53 AU-12 requires attributable audit content).
def canonical_id(legacy: dict[str, Any]) -> str:
    try:
        seed = f"{legacy['doc_id']}|{legacy['created_dt']}|{legacy['agency_code']}"
    except KeyError as exc:                       # 1. fail loud: a missing immutable field
        logger.error('{"event":"id_field_missing","field":"%s"}', exc.args[0])
        raise                                     #    must never silently produce a random ID
    return str(uuid.uuid5(uuid.NAMESPACE_OID, seed))


def payload_hash(body: bytes) -> str:
    # 2. SHA-256 is the delta signal AND the chain-of-custody fingerprint for litigation.
    return hashlib.sha256(body).hexdigest()


def audit(event: str, **fields: Any) -> None:
    # 3. One append-only JSON line per decision = the AU-2 audit record.
    logger.info(json.dumps({
        "event": event,
        "ts": datetime.now(timezone.utc).isoformat(),
        **fields,
    }))

Phase two transmits the changed records over a connection-pooled async client with idempotency enforced on the wire. Each POST carries an X-Request-ID set to the canonical ID so the target rejects a duplicate within its dedup window and returns 409 Conflict instead of re-ingesting. Transient 5xx responses retry with jittered exponential backoff; 4xx client errors fail fast to the dead-letter queue rather than poisoning the retry path; and a streamed request body keeps memory flat regardless of how large an archival record is.

python

import asyncio
import random
import aiohttp

MAX_RETRIES = 4
BREAKER_TRIP = 5          # consecutive 5xx before the endpoint is treated as down

async def sync_record(session: aiohttp.ClientSession, api: str,
                      legacy: dict[str, Any], body: bytes,
                      breaker: dict[str, int], sem: asyncio.Semaphore) -> str:
    cid = canonical_id(legacy)
    digest = payload_hash(body)
    headers = {
        "X-Request-ID": cid,                      # 1. content-stable idempotency key
        "Content-Type": "application/json",
        "X-Payload-SHA256": digest,               # 2. target verifies integrity in transit
    }
    for attempt in range(1, MAX_RETRIES + 1):
        if breaker["fails"] >= BREAKER_TRIP:      # 3. circuit open: do not hammer a sick API
            audit("breaker_open", canonical_id=cid, attempt=attempt)
            raise RuntimeError("circuit_open")
        try:
            async with sem:                       # 4. bound concurrency -> bound memory/FDs
                async with session.post(
                    api, data=body, headers=headers,
                    timeout=aiohttp.ClientTimeout(total=30),
                ) as resp:
                    if resp.status in (200, 201, 409):   # 409 == already ingested: success
                        breaker["fails"] = 0
                        audit("synced", canonical_id=cid, sha256=digest,
                              status=resp.status, attempt=attempt)
                        return f"{resp.status}:{cid}"
                    if 400 <= resp.status < 500:         # 5. client error: fail fast to DLQ
                        audit("dlq_client_error", canonical_id=cid, status=resp.status)
                        raise ValueError(f"client_error:{resp.status}")
                    breaker["fails"] += 1                # 6. 5xx: transient, retry with backoff
        except (aiohttp.ClientError, asyncio.TimeoutError) as exc:
            breaker["fails"] += 1
            audit("transient_error", canonical_id=cid, attempt=attempt, error=str(exc))
        await asyncio.sleep(min(2 ** attempt, 30) + random.uniform(0, 1))  # 7. jittered backoff
    audit("dlq_exhausted", canonical_id=cid, sha256=digest)               # 8. give up -> DLQ
    raise RuntimeError(f"retries_exhausted:{cid}")

Concurrency is capped with an asyncio.Semaphore sized to min(os.cpu_count() * 2, 32), and the request body is streamed from the staged copy rather than buffered whole — a multi-gigabyte archival TIFF must never be read into a single bytes object. Validate every normalized payload against a strict JSON Schema with additionalProperties: false before it reaches sync_record, and route schema rejects to quarantine instead of allowing a partial ingestion. Enforce the same least-privilege transport posture your Security Boundary Configuration layer mandates at every hop, and spread real ingestion load through Async Queue Management so a backlog of changed records never blocks the metadata-diff phase.

Expected Output & Verification

Each record that crosses the boundary emits exactly one append-only audit line. A successful first ingestion and an idempotent retry are distinguishable by status alone:

jsonl

{"event": "synced", "ts": "2026-06-27T14:02:11.418+00:00", "canonical_id": "6f1c...d9", "sha256": "a3f9...", "status": 201, "attempt": 1}
{"event": "synced", "ts": "2026-06-27T14:09:55.002+00:00", "canonical_id": "6f1c...d9", "sha256": "a3f9...", "status": 409, "attempt": 1}

Verify three invariants before trusting a run. First, identity stability: re-running the diff over an unchanged record must produce the same canonical_id and sha256 — a change in either on identical input means an immutable field is leaking into the hash. Second, idempotency: a replayed batch must return 409 (or a cached 200) for every previously synced record and create nothing new on the target. Third, transfer accounting: the count of synced events plus dlq_* events must equal the count of records the diff flagged as changed — there is no silently dropped state. The diagnostics below map the production symptoms you will actually see to a root cause and a fix:

Symptom	Likely root cause	Diagnostic action	Remediation
`409 Conflict` on a first send	Idempotency key collision or clock skew across workers	Verify `X-Request-ID` derivation and NTP sync	Align worker clocks; widen the server dedup window
`MemoryError` on a large record	Body buffered instead of streamed	Profile with `tracemalloc`; inspect the `aiohttp` request	Stream from the staged file; cap `max_content_length` at the gateway
Schema validation failures	Legacy field drift or null injection	Diff incoming JSON against the current schema version	Set `additionalProperties: false`; add a fallback mapping table
Breaker trips repeatedly	Upstream rate limiting or DB lock contention	Check gateway logs; review legacy query plans	Pace requests; parse and honor `Retry-After`
Same record resyncs every run	A mutable field leaked into `canonical_id`	Re-derive the seed from `created_dt`/`agency_code` only	Pin the seed to immutable attributes; backfill IDs

Common Pitfalls

Hashing a non-deterministic JSON serialization. If you compute payload_hash over json.dumps(record) with default settings, Python’s dict ordering or whitespace can change the digest for identical data, so the delta phase re-sends everything and dedup never fires. Serialize with sort_keys=True, separators=(",", ":") (or hash the canonical source bytes) so the same record always yields the same fingerprint.
Retrying 4xx into the dead-letter queue forever. Treating every non-2xx as transient lets a malformed payload or an expired token loop until the breaker trips, starving healthy records of throughput. Branch on the status class: 5xx and connection errors retry with jittered backoff, 4xx fails fast to quarantine for human review.
Encoding corruption from the legacy export. Old DMS exports frequently arrive as Windows-1252 or UTF-8 with a BOM, and decoding them as plain UTF-8 mangles names and case numbers in ways that survive into the FOIA production. Detect and normalize encoding (strip the BOM, transcode to UTF-8) during the normalization step, and reject — never guess — bytes that fail to decode cleanly.

Frequently Asked Questions

How do I detect changes when the legacy system has no reliable change feed?

Use a layered signal. If a monotonic sequence counter or a trustworthy Last-Modified exists, diff against the high-water mark from the last run; where it does not, fall back to a periodic content hash sweep of the candidate set. Either way, the authoritative change decision is the sha256 comparison — the metadata signal only narrows which records you bother to fetch, so an unreliable Last-Modified costs throughput but never causes a missed or duplicated record.

Why key idempotency on a UUIDv5 instead of letting the API assign IDs?

A server-assigned ID is only known after the first successful write, so a retry after a timeout you never saw the response to would create a duplicate. A UUIDv5 derived from immutable legacy attributes is computable before the first call and identical on every retry, so the target can reject the duplicate with 409 deterministically. That stable identity is also what lets you replay the dead-letter queue safely once the endpoint recovers.

How do I replay the dead-letter queue without double-counting the disclosure?

Replay in chronological order using the same X-Request-ID/canonical ID each record carried originally. Because the target deduplicates on that key, any record that actually landed before the failure returns 409 and is skipped, while genuinely unsent records are written exactly once. Log every replay decision to the same append-only audit store so a compliance officer can reconstruct precisely which records were delivered and when.

Repository Sync Protocols — the parent ingestion boundary this sync plugs into
Document Retrieval & Parsing — the pipeline that consumes the synced records
Extracting Metadata from Scanned Municipal Records with OpenCV
Configuring Role-Based Access for Public Records Portals
Managing High-Volume Intake with Celery Task Queues

← Back to Repository Sync Protocols

Syncing Legacy Document Management Systems with Modern REST APIs #

Scenario & Compliance Stakes #

Prerequisites #

Implementation #

Expected Output & Verification #

Common Pitfalls #

Frequently Asked Questions #

Related #