AI Agents SDE Task Viewer
      • Context
      • Plan
      • Prd
  1. Home
  2. AgentSDE
  3. agent-core
  4. gh-441
  5. plan
  6. plan.md
plan.md(3.5 KB)· Apr 12, 2026· 3 min read
  • Summary
  • Files
  • Steps
  • Verification
  • Risks

AW-5: Consume phase-result queue and handle crash recovery#

Summary#

Add crash recovery to the invoke/queue layer so that phase results arriving after an agent-core restart are processed and advance the pipeline, rather than being silently dropped. This builds on AW-4 (#440) which introduces InvocationResultListener and waitForResult().

Files#

FileActionDescription
src/invoke/invocation-result.listener.tsmodifyAdd recoverResult() for unmatched results; inject TaskStateService and InternalAdapterService
src/invoke/invocation-result.listener.spec.tsmodifyAdd tests for crash recovery: duplicate delivery, no-task-found, successful recovery
src/queue/sqlite-job-queue.tsmodifySmart stale job detection in onModuleInit(): check ai-done.json and ai.pid before marking failed
src/queue/sqlite-job-queue.spec.tsmodifyAdd tests for smart stale detection: live PID, existing done file, truly stale
src/invoke/claude-invocation.service.tsmodifyAdd configurable timeout to waitForResult() that rejects and cleans up pending map entry
src/invoke/claude-invocation.service.spec.tsmodifyAdd timeout expiry test for waitForResult()

Steps#

  1. Add recoverResult() to InvocationResultListener — When a phase-result arrives with no matching resolver in the pending map, call recoverResult() instead of just logging. Inject TaskStateService and InternalAdapterService. Look up the task by jobId payload, read current phase/status, and if the task hasn't already advanced past the incoming phase, call InternalAdapterService.handlePhaseResult() to advance the pipeline.
  2. Add duplicate-delivery guard in recoverResult() — Before advancing, check task.currentPhase and the phase status column. If the task has already moved past the incoming phase (status is complete or skipped), log a warning and return early — no re-advance.
  3. Smart stale job detection in SqliteJobQueue.onModuleInit() — Replace the blind "mark all processing as failed" logic. For each stale processing job: (a) parse taskDir from the job payload, (b) check if {taskDir}/meta/ai-done.json exists — if so, treat as a completed result and enqueue recovery, (c) check if {taskDir}/meta/ai.pid exists and process.kill(pid, 0) succeeds — if alive, leave job as processing, (d) otherwise mark as failed.
  4. Add timeout to waitForResult() — Add a setTimeout that rejects the promise after CLAUDE_TIMEOUT_SECS + 100 seconds (~3,700,000 ms). On timeout, delete the entry from the pending map and reject with a descriptive TimeoutError.
  5. Write unit tests — Cover: (a) recoverResult() happy path advancing pipeline, (b) duplicate delivery no-op, (c) no-task-found early return, (d) stale job with live PID stays processing, (e) stale job with done file triggers recovery, (f) truly stale job marked failed, (g) waitForResult() timeout rejection and map cleanup.

Verification#

  • npm run build passes with no type errors
  • npm run test passes — all new and existing tests green
  • npm run lint passes with zero warnings

Risks#

  • Dependency on AW-4 (#440): This issue assumes InvocationResultListener, waitForResult(), and the pending resolver map exist. Must be implemented after AW-4 merges.
  • PID check false positive: process.kill(pid, 0) may succeed for a recycled PID. Mitigation: combine PID check with age heuristic (job lockedAt age > timeout → treat as stale regardless).
ContextPrd