Skip to content

fix(wrap): detect stale proxy on target port and auto-cleanup before starting#1385

Closed
lennney wants to merge 1 commit into
headroomlabs-ai:mainfrom
lennney:fix/wrap-port-conflict-cleanup
Closed

fix(wrap): detect stale proxy on target port and auto-cleanup before starting#1385
lennney wants to merge 1 commit into
headroomlabs-ai:mainfrom
lennney:fix/wrap-port-conflict-cleanup

Conversation

@lennney

@lennney lennney commented Jun 24, 2026

Copy link
Copy Markdown

Description

When a terminal running headroom wrap <agent> is killed without proper cleanup (window close, SSH timeout, crash), the background proxy process becomes orphaned and continues holding the proxy port. The next headroom wrap on the same port waits 30-45 seconds (the proxy cold-start window) and then fails with a confusing RuntimeError("Proxy failed to start...").

This PR adds stale proxy detection and auto-cleanup at the top of _start_proxy(). Before spawning uvicorn, it checks whether the target port is already occupied by a headroom proxy (orphaned or otherwise). If so, it kills the old proxy and starts fresh. If the port is held by a non-headroom process, it reports a clear error.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Performance improvement
  • Code refactoring (no functional changes)

Changes Made

  • headroom/cli/wrap.py: Add _ensure_port_free(), _find_process_on_port(), _linux_find_process_on_port(), _resolve_inode_to_pid(), _is_headroom_proxy(), _kill_process(). New functions detect stale headroom proxy on the target port and clean it up before _start_proxy() spawns uvicorn. Uses only /proc/net/tcp + /proc/*/fd/ (Linux) or lsof (macOS fallback). macOS _is_headroom_proxy uses ps fallback when /proc unavailable. Windows uses getattr(signal, "SIGKILL", signal.SIGTERM) for safe escalation. Zero new dependencies.
  • tests/test_cli/test_wrap_helpers.py: Add 14 new tests covering all helper functions, /proc/net/tcp parsing, socket symlink matching, tcp6 fallback, stale detection, non-headroom rejection, and kill escalation.

Testing

  • Unit tests pass (pytest): python -m pytest tests/test_cli/test_wrap_helpers.py::TestEnsurePortFree -v — 14 passed
  • Linting passes (ruff check .)
  • Type checking passes (mypy headroom)
  • New tests added for new functionality
  • Manual testing performed

Test Output

> python -m pytest tests/test_cli/test_wrap_helpers.py::TestEnsurePortFree -v --no-header
============================= 14 passed in 0.34s ==============================
test_ensure_port_free_port_is_free PASSED
test_ensure_port_free_stale_headroom PASSED
test_ensure_port_free_non_headroom PASSED
test_linux_find_process_on_port_empty PASSED
test_linux_find_process_on_port_found PASSED
test_is_headroom_proxy_true PASSED
test_is_headroom_proxy_false PASSED
test_kill_process_terminates PASSED
test_kill_process_force_kill PASSED
test_resolve_inode_to_pid_matches_symlink PASSED
test_linux_find_process_on_port_tcp6 PASSED

Real Behavior Proof

  • Environment: Ubuntu 24.04 x86_64, Python 3.12.3, headroom 0.26.0 editable install in Hermes venv
  • Exact command / steps: Start headroom proxy on port 18794 via headroom proxy --port 18794, verify bound via socket probe, call _ensure_port_free(18794) from the modified code, verify return value True, verify socket probe returns None (port free), verify old process dead via os.kill(pid, 0) → OSError
  • Observed result: _ensure_port_free detected the stale proxy via /proc/net/tcp + /proc/ PID resolution, sent SIGTERM, waited 3s, verified port free, returned True. Port 18794 was free and the old PID was confirmed dead. For non-headroom processes (e.g. HTTP server), _ensure_port_free returned False with no kill.
  • Not tested: Windows (/proc/net/tcp unavailable, lsof not guaranteed — gracefully returns None). macOS with lsof fallback (code path exists but no macOS CI). Headroom proxy on macOS (no macOS runner available).

Review Readiness

  • I have performed a self-review
  • This PR is ready for human review

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have updated the CHANGELOG.md if applicable

@JerrettDavis

Copy link
Copy Markdown
Collaborator

Draft review note from a local pass: the stale-port helper shape is narrower than #1356 and the unrelated no-optimize changes are no longer present, which is good. There is still a Windows blocker in the focused coverage.

uv run --with pytest python -m pytest tests/test_cli/test_wrap_helpers.py::TestEnsurePortFree -q fails 1/14 on Windows:

TestEnsurePortFree.test_kill_process_force_kill raises AttributeError: module 'signal' has no attribute 'SIGKILL' from headroom/cli/wrap.py when _kill_process() falls back from SIGTERM to SIGKILL.

uv run --with ruff ruff check headroom/cli/wrap.py tests/test_cli/test_wrap_helpers.py passes. Before marking this ready, please make the escalation path platform-aware, e.g. use SIGKILL only when available and use an appropriate Windows termination fallback/test expectation.

@lennney lennney force-pushed the fix/wrap-port-conflict-cleanup branch from 6c47a37 to 2479000 Compare June 24, 2026 17:23
…starting

Adds _ensure_port_free() and helpers to _start_proxy() that:
- Detect if target port is already in use (via existing _port_bind_error)
- Find the owning process via /proc/net/tcp (Linux) or lsof (macOS)
- Only kills processes identified as stale headroom proxies
- Reports clear error for non-headroom processes
- Uses zero new dependencies
- macOS _is_headroom_proxy now uses ps fallback when /proc unavailable
- Windows SIGKILL fallback (getattr) for _kill_process

Fixes the case where a terminal is killed, leaving an orphaned headroom
proxy on the port — the next wrap would wait 30s then fail with a
confusing RuntimeError. Now it auto-cleans and restarts.

Tests: 14 new tests for all helpers + edge cases (54 total, all pass)
@lennney lennney force-pushed the fix/wrap-port-conflict-cleanup branch from 2479000 to a10ae98 Compare June 24, 2026 17:28
@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

PR governance

This PR follows the template and is marked ready for human review.

@lennney lennney marked this pull request as ready for review June 25, 2026 01:35
@github-actions github-actions Bot added the status: ready for review Pull request body is complete and the author marked it ready for human review label Jun 25, 2026

@JerrettDavis JerrettDavis left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking on this failure mode. The direction is useful, but I found a few blockers in the current implementation that need tightening before this can safely ship.

Findings:

  • headroom/cli/wrap.py:461 identifies a process as a Headroom proxy with a raw substring check: "headroom" in cmdline and "proxy" in cmdline. Because _ensure_port_free() will kill the matching PID, this needs a stricter argv-level check for known Headroom proxy invocations, such as headroom proxy ... or python -m headroom.cli proxy .... A non-Headroom process with those words in its command line could be terminated today.

  • headroom/cli/wrap.py:523 treats pid is None as success after the initial bind probe already proved the port was occupied. _find_process_on_port() can return None because /proc/lsof resolution failed or permissions blocked inspection, not only because the port became free. Please re-probe the bind before returning success; if it is still occupied and owner resolution failed, fail clearly instead of falling through to the old cold-start timeout path.

  • headroom/cli/wrap.py:376 uses lsof -ti tcp:{port} without restricting to listening sockets. On macOS/BSD this can return processes with established connections involving that TCP port rather than the listener that owns the local bind. The fallback should filter to LISTEN, e.g. -iTCP:{port} -sTCP:LISTEN, before any kill decision is made.

  • headroom/cli/wrap.py:502 uses signal.SIGKILL directly. On Windows that attribute is absent, and the new focused test fails locally: uv run --with pytest --with pytest-asyncio python -m pytest tests/test_cli/test_wrap_helpers.py::TestEnsurePortFree -q -> AttributeError: module 'signal' has no attribute 'SIGKILL'. There is already an older helper in this file using getattr(signal, "SIGKILL", signal.SIGTERM); this path needs the same platform-safe handling.

  • tests/test_cli/test_wrap_helpers.py drops the existing _resolve_1m_model tests from this file. rg _resolve_1m_model tests finds no replacement coverage. Please restore those tests or move them explicitly; this PR should not reduce unrelated wrap coverage.

I did not approve because the auto-cleanup path can kill local processes, so the owner detection needs to be conservative. PR Governance is green, but there are no full CI checks posted for the latest commit, and the focused test fails on Windows locally as noted above.

@JerrettDavis JerrettDavis added status: code changes requested and removed status: ready for review Pull request body is complete and the author marked it ready for human review labels Jun 25, 2026
@lennney

lennney commented Jun 25, 2026

Copy link
Copy Markdown
Author

Replaced by #1406 with Vite-style port fallback (no process killing, pure socket.bind)

@lennney lennney closed this Jun 25, 2026
@lennney lennney deleted the fix/wrap-port-conflict-cleanup branch July 3, 2026 10:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants