AW-5: Consume phase-result queue and handle crash recovery#
Closes #441
Problem#
When agent-core restarts while AI invocations are in flight, in-memory promise resolvers are lost. Phase results arriving after restart were silently dropped, leaving tasks stuck.
Task / Link#
GitHub Issue #441 — AW-5 (parent: #437 AW-1)
Changes#
src/invoke/invocation-result.listener.ts— New service withwaitForResult()(timeout-backed promise),handleResult()(resolves pending resolver or falls back to crash recovery), andrecoverResult()(duplicate-delivery guard + pipeline advancement viaInternalAdapterService)src/queue/sqlite-job-queue.ts— Replaced blind "mark-all-failed" stale detection with smart logic: checksai-done.json(emit recovery event), liveai.pid(leave as processing), otherwise mark failedsrc/queue/job-queue.interface.ts— AddedtaskDir?: stringtoDispatchEventfor stale job recoverysrc/invoke/invoke.module.ts— RegisteredInvocationResultListenerwith required module importssrc/invoke/invocation-result.listener.spec.ts— New: 13 tests covering timeout, happy path, duplicate delivery, no-task-found, and event handlersrc/queue/sqlite-job-queue.spec.ts— Added 5 tests for smart stale detection scenarios
Notes#
- Depends on AW-4 (#440) for full queue integration;
InvocationResultListeneris a standalone service ready to be wired into BullMQ processor once AW-4 merges - PID liveness check uses
process.kill(pid, 0)combined with a job age heuristic (lockedAtage >CLAUDE_TIMEOUT_SECS + 100s) to guard against recycled PIDs ai-done.jsonformat expected:{ taskId: number, phase: string, signalKind: string }
Testing#
- Unit tests:
npm run test— 27 new tests across 2 spec files, all passing - Build:
npm run build— no TypeScript errors - Lint:
npm run lint— zero warnings