Plan — EADDRINUSE on restart (#15)#
Summary#
Add a PM2 ecosystem config with restart_delay, max_restarts, and min_uptime to prevent EADDRINUSE failures after crash restarts. Update the deploy workflow to use the config file, and add a post-restart port-availability check script that can alert on-call if port 8998 remains unreachable.
Files#
| File | Action | Description |
|---|---|---|
ecosystem.config.cjs | create | PM2 process config with restart_delay, max_restarts, min_uptime |
.github/workflows/deploy.yml | modify | Use pm2 start ecosystem.config.cjs instead of pm2 restart viewerv2-backend; add post-restart port monitoring step |
scripts/check-port.sh | create | Port-availability check script — polls port 8998 after restart, exits non-zero (alertable) if unreachable after 30s |
Steps#
- Create
ecosystem.config.cjsat repo root withrestart_delay: 3000,max_restarts: 5,min_uptime: 5000, script pointing todist/main.js, and env vars loaded from.env. - Create
scripts/check-port.shthat pollslocalhost:8998/healthfor up to 30 seconds, printing status, and exits 1 if the port is still unreachable — suitable for triggering on-call alerts. - Update
.github/workflows/deploy.yml"Restart service" step to usepm2 startOrRestart ecosystem.config.cjsso the config is always applied on deploy. - Add a "Post-restart port monitoring" step in
deploy.ymlthat runsscripts/check-port.shto verify port availability after restart. - Run
npm run lintandnpm run testto verify no regressions.
Verification#
kill -9 <pid>of the PM2-managed process results in a clean restart with no EADDRINUSE inpm2 logs- Process accepts requests within 10 seconds of restart
scripts/check-port.shexits 0 on healthy restart and exits 1 when port is unreachable after 30s
Risks#
- Changing from
pm2 restarttopm2 startOrRestartwith an ecosystem file requires the VPS to have the config file present; the deploy workflow already pulls latest master so this is handled automatically. - The 3-second delay adds ~3s to recovery time — acceptable tradeoff vs. EADDRINUSE loops.