fix(wrap): detect stale proxy on target port and auto-cleanup before starting#1385
fix(wrap): detect stale proxy on target port and auto-cleanup before starting#1385lennney wants to merge 1 commit into
Conversation
|
Draft review note from a local pass: the stale-port helper shape is narrower than #1356 and the unrelated no-optimize changes are no longer present, which is good. There is still a Windows blocker in the focused coverage.
|
6c47a37 to
2479000
Compare
…starting Adds _ensure_port_free() and helpers to _start_proxy() that: - Detect if target port is already in use (via existing _port_bind_error) - Find the owning process via /proc/net/tcp (Linux) or lsof (macOS) - Only kills processes identified as stale headroom proxies - Reports clear error for non-headroom processes - Uses zero new dependencies - macOS _is_headroom_proxy now uses ps fallback when /proc unavailable - Windows SIGKILL fallback (getattr) for _kill_process Fixes the case where a terminal is killed, leaving an orphaned headroom proxy on the port — the next wrap would wait 30s then fail with a confusing RuntimeError. Now it auto-cleans and restarts. Tests: 14 new tests for all helpers + edge cases (54 total, all pass)
2479000 to
a10ae98
Compare
PR governanceThis PR follows the template and is marked ready for human review. |
JerrettDavis
left a comment
There was a problem hiding this comment.
Thanks for taking on this failure mode. The direction is useful, but I found a few blockers in the current implementation that need tightening before this can safely ship.
Findings:
-
headroom/cli/wrap.py:461identifies a process as a Headroom proxy with a raw substring check:"headroom" in cmdline and "proxy" in cmdline. Because_ensure_port_free()will kill the matching PID, this needs a stricter argv-level check for known Headroom proxy invocations, such asheadroom proxy ...orpython -m headroom.cli proxy .... A non-Headroom process with those words in its command line could be terminated today. -
headroom/cli/wrap.py:523treatspid is Noneas success after the initial bind probe already proved the port was occupied._find_process_on_port()can returnNonebecause/proc/lsofresolution failed or permissions blocked inspection, not only because the port became free. Please re-probe the bind before returning success; if it is still occupied and owner resolution failed, fail clearly instead of falling through to the old cold-start timeout path. -
headroom/cli/wrap.py:376useslsof -ti tcp:{port}without restricting to listening sockets. On macOS/BSD this can return processes with established connections involving that TCP port rather than the listener that owns the local bind. The fallback should filter to LISTEN, e.g.-iTCP:{port} -sTCP:LISTEN, before any kill decision is made. -
headroom/cli/wrap.py:502usessignal.SIGKILLdirectly. On Windows that attribute is absent, and the new focused test fails locally:uv run --with pytest --with pytest-asyncio python -m pytest tests/test_cli/test_wrap_helpers.py::TestEnsurePortFree -q->AttributeError: module 'signal' has no attribute 'SIGKILL'. There is already an older helper in this file usinggetattr(signal, "SIGKILL", signal.SIGTERM); this path needs the same platform-safe handling. -
tests/test_cli/test_wrap_helpers.pydrops the existing_resolve_1m_modeltests from this file.rg _resolve_1m_model testsfinds no replacement coverage. Please restore those tests or move them explicitly; this PR should not reduce unrelated wrap coverage.
I did not approve because the auto-cleanup path can kill local processes, so the owner detection needs to be conservative. PR Governance is green, but there are no full CI checks posted for the latest commit, and the focused test fails on Windows locally as noted above.
|
Replaced by #1406 with Vite-style port fallback (no process killing, pure socket.bind) |
Description
When a terminal running
headroom wrap <agent>is killed without proper cleanup (window close, SSH timeout, crash), the background proxy process becomes orphaned and continues holding the proxy port. The nextheadroom wrapon the same port waits 30-45 seconds (the proxy cold-start window) and then fails with a confusingRuntimeError("Proxy failed to start...").This PR adds stale proxy detection and auto-cleanup at the top of
_start_proxy(). Before spawning uvicorn, it checks whether the target port is already occupied by a headroom proxy (orphaned or otherwise). If so, it kills the old proxy and starts fresh. If the port is held by a non-headroom process, it reports a clear error.Type of Change
Changes Made
headroom/cli/wrap.py: Add_ensure_port_free(),_find_process_on_port(),_linux_find_process_on_port(),_resolve_inode_to_pid(),_is_headroom_proxy(),_kill_process(). New functions detect stale headroom proxy on the target port and clean it up before_start_proxy()spawns uvicorn. Uses only/proc/net/tcp+/proc/*/fd/(Linux) orlsof(macOS fallback). macOS_is_headroom_proxyusespsfallback when/procunavailable. Windows usesgetattr(signal, "SIGKILL", signal.SIGTERM)for safe escalation. Zero new dependencies.tests/test_cli/test_wrap_helpers.py: Add 14 new tests covering all helper functions, /proc/net/tcp parsing, socket symlink matching, tcp6 fallback, stale detection, non-headroom rejection, and kill escalation.Testing
pytest):python -m pytest tests/test_cli/test_wrap_helpers.py::TestEnsurePortFree -v— 14 passedruff check .)mypy headroom)Test Output
Real Behavior Proof
headroom proxy --port 18794, verify bound via socket probe, call_ensure_port_free(18794)from the modified code, verify return value True, verify socket probe returns None (port free), verify old process dead viaos.kill(pid, 0)→ OSError_ensure_port_freedetected the stale proxy via/proc/net/tcp+/proc/PID resolution, sent SIGTERM, waited 3s, verified port free, returned True. Port 18794 was free and the old PID was confirmed dead. For non-headroom processes (e.g. HTTP server),_ensure_port_freereturned False with no kill./proc/net/tcpunavailable,lsofnot guaranteed — gracefully returns None). macOS withlsoffallback (code path exists but no macOS CI). Headroom proxy on macOS (no macOS runner available).Review Readiness
Checklist