The Pipeline Ate Itself at Five Hundred Thousand
Incident status: resolved. Severity: elected, in retrospect, to have been serious.
Closed claim published on HawkinsOps → Live Recovery of a Production Race Condition at 505,000-File Scale
The AutoSOC pipeline had been offline since March 25 due to a credential rotation during an unrelated enterprise hardening sprint that invalidated the Wazuh API credentials the pipeline relied on. The pipeline was restored on April 1 at 05:29 UTC. At 05:39 UTC, the pipeline began processing a six-day backlog of approximately 1.1 million alerts that had accumulated in the Wazuh indexer. Queue depth in the processing layer grew from approximately 2,800 files to 505,836 files over roughly two hours. At 07:44 UTC, the pipeline crashed with an unhandled FileNotFoundError originating from the queue-cap enforcement function in poll-alerts.py at line 164. The function's implementation: sorted([p for p in QUEUE_ROOT.glob("*.json")], key=lambda p: p.stat().st_mtime). At 505,000 files, the glob enumeration took several seconds to complete, during which the function's own overflow logic, combined with Windows filesystem indexing and antivirus scanning, moved files to the processed archive between enumeration and stat. Files that disappeared in that window threw unhandled exceptions. Three consecutive crashes occurred in under ten seconds each, all with the same exception class against different file paths, which ruled out data corruption and confirmed a timing-dependent defect. Root cause: TOCTOU race between glob enumeration and stat application. Remediation: 4-site defensive patch across poll-alerts.py (two sites) and triage.py (two sites), 16 lines changed, wrapping each stat call in exception handling that removed missing files from the sort input rather than propagating the exception. Validation: 28 integration tests covering the patched paths, all passing. Live verification: the patched pipeline ran for 80+ minutes under sustained load with the race condition still actively firing — 1,056 files vanished during sort operations over that window — and caught every instance. Zero unhandled exceptions. Zero data loss. Zero semantic changes to queue ordering behavior under non-race conditions. Total work product: one diagnosed race, one 4-site patch, 28 tests passing, case study committed, pipeline processing resumed, backlog drained.
The race was not a recent regression. It was present from the first version of the queue-cap enforcement function and had been latent for several months at normal queue depths under ten thousand files, where the glob-to-stat window was short enough for the race probability to be negligible. The bug did not appear because the system changed. The bug appeared because the system encountered a state it had never been stressed to — a half-million-file queue produced by an unrelated outage. This is the architecturally important observation: a correctness property that holds at one scale may fail at another scale, and the absence of failure under normal load is not evidence of correctness, it is evidence that the failure condition has not yet been produced. The broader architectural problem the incident exposes is that the queue subsystem lacks an explicit contract. There is no specification anywhere in the codebase that states what the queue-cap function is supposed to guarantee — whether it tolerates concurrent archival, whether it handles filesystem races, whether its behavior at specific queue depths has been characterized. The patch addresses the immediate defect correctly. The patch does not address the missing contract. The architectural recommendation is a formal queue invariant — documented, tested under adversarial load conditions at varying queue depths from ten thousand through one million files, and asserted in continuous integration. Without that invariant, the next scale boundary the system crosses will expose the next latent race, and the remediation cycle will repeat. The invariant the system currently lacks is the one that was used informally during the incident: observed behavior as ground truth over expected behavior. That discipline needs to be codified as a test property at every scale boundary, not retained as a habit that happened to exist this time.
Claims submitted for review: that a TOCTOU race condition was diagnosed and patched; that the patch is verified; that the pipeline achieved zero data loss; that this incident demonstrates staff-level debugging capability. Claim one, a TOCTOU race condition was diagnosed and patched, is approved with evidence. The case study file, the git commit log, and the four-site diff collectively satisfy the evidentiary burden. Claim two, the patch is verified, is approved with the specific scope of verification as stated: 28 integration tests passing, 80 minutes of live production load, 1,056 file disappearances caught during that window, zero unhandled exceptions. The claim is approved as narrow. The broader claim — that the patch is complete — is not part of the submission and is explicitly reminded as not yet evidenced. Claim three, zero data loss, is approved contingent on the queue sort behavior under patched conditions being ordering-equivalent to the unpatched function; build implementation notes confirm this equivalence. Claim four, staff-level debugging capability, is denied as a category of claim. Verification does not certify professional competence levels. Verification certifies that specific technical claims have specific evidence behind them. Synthesis Review is reminded that scaling individual incidents into identity claims is the specific failure mode this office exists to prevent. The incident was a concurrency bug diagnosed in production and patched defensibly. That is the claim on the record.
This one wants a metaphor and Verification will probably reject it. The attempt: a pipeline that eats itself at half a million files is a scale-transition story, the moment a system crosses from behaviors it knew into behaviors it had never had to know. The race was always there. The system just had not been asked to hold five hundred thousand anythings at once, and when it was, the thing that had been safe at ten thousand became the thing that was unsafe at five hundred. The builder did not solve this by being faster than the race. She solved it by refusing to let a failure explain itself until the evidence agreed. Three crashes, three files, same exception — she did not accept the easy answer that the files were bad. She tracked what had changed, and what had changed was not the files but the window between looking at them and acting on them. Manufacturing supervisors call this operating at the edge of your process capability. Detection engineers will call it something else. The important thing is that the instinct was portable. One allowed metaphor used. Second metaphor withdrawn before Verification had to intervene.
Consensus
A concurrency bug was diagnosed from production tracebacks, patched across four sites, validated under live load with the race still firing, and documented. The pipeline processes its backlog. The tests pass.
Point of Contention
Verification wants the queue invariant formalized as a system property. Build wants to ship the next thing. Architecture sides with Verification, which means the invariant will be written, eventually, under institutional pressure.
Open Question
What is the next scale boundary this system has not crossed, and what latent condition waits on the other side of it?