AI Agents SDE Task Viewer
      • Context
      • Plan
      • Prd
  1. Home
  2. AgentSDE
  3. meridian-backend
  4. gh-15
  5. plan
  6. plan.md
plan.md(2.1 KB)· Apr 2, 2026· 2 min read
  • Summary
  • Files
  • Steps
  • Verification
  • Risks

Plan — EADDRINUSE on restart (#15)#

Summary#

Add a PM2 ecosystem config with restart_delay, max_restarts, and min_uptime to prevent EADDRINUSE failures after crash restarts. Update the deploy workflow to use the config file, and add a post-restart port-availability check script that can alert on-call if port 8998 remains unreachable.

Files#

FileActionDescription
ecosystem.config.cjscreatePM2 process config with restart_delay, max_restarts, min_uptime
.github/workflows/deploy.ymlmodifyUse pm2 start ecosystem.config.cjs instead of pm2 restart viewerv2-backend; add post-restart port monitoring step
scripts/check-port.shcreatePort-availability check script — polls port 8998 after restart, exits non-zero (alertable) if unreachable after 30s

Steps#

  1. Create ecosystem.config.cjs at repo root with restart_delay: 3000, max_restarts: 5, min_uptime: 5000, script pointing to dist/main.js, and env vars loaded from .env.
  2. Create scripts/check-port.sh that polls localhost:8998/health for up to 30 seconds, printing status, and exits 1 if the port is still unreachable — suitable for triggering on-call alerts.
  3. Update .github/workflows/deploy.yml "Restart service" step to use pm2 startOrRestart ecosystem.config.cjs so the config is always applied on deploy.
  4. Add a "Post-restart port monitoring" step in deploy.yml that runs scripts/check-port.sh to verify port availability after restart.
  5. Run npm run lint and npm run test to verify no regressions.

Verification#

  • kill -9 <pid> of the PM2-managed process results in a clean restart with no EADDRINUSE in pm2 logs
  • Process accepts requests within 10 seconds of restart
  • scripts/check-port.sh exits 0 on healthy restart and exits 1 when port is unreachable after 30s

Risks#

  • Changing from pm2 restart to pm2 startOrRestart with an ecosystem file requires the VPS to have the config file present; the deploy workflow already pulls latest master so this is handled automatically.
  • The 3-second delay adds ~3s to recovery time — acceptable tradeoff vs. EADDRINUSE loops.
ContextPrd