fix(wrap): replace stale-proxy detection with Vite-style port fallback#1406
fix(wrap): replace stale-proxy detection with Vite-style port fallback#1406lennney wants to merge 12 commits into
Conversation
PR governanceThis PR follows the template and is marked ready for human review. |
JerrettDavis
left a comment
There was a problem hiding this comment.
The process-killing removal is the right direction, but the fallback port is not propagated to callers. _ensure_proxy() now starts the proxy on �ctual_port and prints the right dashboard URL, but it still returns only the Popen object. All existing callers continue using the originally requested port to set ANTHROPIC_BASE_URL / OPENAI_BASE_URL, inject provider config, register cleanup, and stop the proxy.\n\nThat means if 8787 is busy and _find_available_port() chooses 8788, the proxy starts on 8788 but the wrapped agent still points at 8787, i.e. the busy/stale port this PR was trying to avoid. Please change the contract so the selected port is returned/recorded and every caller uses that actual port, with regression coverage for a fallback case that asserts the launched agent environment/config points at the fallback port.
JerrettDavis
left a comment
There was a problem hiding this comment.
The fallback-port propagation is much better now for the generic launcher and proxy-only wrappers, but the Codex path still resolves the port too early.
codex() calls _find_available_port(port) before _ensure_proxy(). That means a healthy existing Headroom proxy on 8787 is treated as a busy port and Codex rewrites MCP/provider config to 8788, then _launch_tool() starts or uses 8788. The previous behavior would reuse the already-running compatible proxy on 8787. The same early lookup also breaks --no-proxy: if the user asks Codex to attach to an existing proxy on 8787, the bind probe sees 8787 as busy and points Codex at the next free port where no proxy is running.
Please route Codex through the same selected-port contract as the other wrappers: let _ensure_proxy() decide whether to reuse the requested port or fall back, then inject MCP/provider config and build the launch env from the returned actual_port. Add regression coverage for both cases: an existing compatible proxy on the requested port should keep that port, and --no-proxy --port 8787 should not rewrite config/env to 8788 just because 8787 is occupied.
|
@JerrettDavis Fix pushed. Here's what changed: Problem:
Fix:
Regression test added: Commit: |
…starting Adds _ensure_port_free() and helpers to _start_proxy() that: - Detect if target port is already in use (via existing _port_bind_error) - Find the owning process via /proc/net/tcp (Linux) or lsof (macOS) - Only kills processes identified as stale headroom proxies - Reports clear error for non-headroom processes - Uses zero new dependencies - macOS _is_headroom_proxy now uses ps fallback when /proc unavailable - Windows SIGKILL fallback (getattr) for _kill_process Fixes the case where a terminal is killed, leaving an orphaned headroom proxy on the port — the next wrap would wait 30s then fail with a confusing RuntimeError. Now it auto-cleans and restarts. Tests: 14 new tests for all helpers + edge cases (54 total, all pass)
Replace the 180-line process-killing approach (/proc/net/tcp hex parsing, inode->PID resolution, cmdline check, SIGTERM/kill) with a simple 15-line Vite-style socket.bind probe: if port is busy (EADDRINUSE or EACCES), try the next port. Pure stdlib, no /proc, no lsof, no subprocess — works identically on Linux/macOS/Windows. No return type changes. - Add _find_available_port() — socket.bind loop, skips EADDRINUSE + EACCES - _ensure_proxy: replaces _port_bind_error() raise with auto-fallback - _start_proxy: removed _ensure_port_free() call (caller owns port check) - Removed 8 dead functions (_find_process_on_port, _linux_find_..., _resolve_inode_to_pid, _is_headroom_proxy, _read_process_cmdline, _kill_process, _ensure_port_free, _format_unbindable_port_error) - Tests: 6 new _find_available_port tests, full adapter for persistent tests - Lint: ruff clean, 445 tests pass
_ensure_proxy() returns (Popen | None, actual_port) tuple so callers can route traffic to the port the proxy actually bound to, not the originally requested port. Callers updated: - _run_proxy_only_watcher (cursor/cline/continue) - _launch_tool (aider/copilot/codex/goose/openhands/vibe) — with env URL string replacement when port fell back - claude - codex: calls _find_available_port early so Config.toml injection and MCP setup point at the resolved port before _launch_tool runs Tests: 445 pass.
…ind_available_port Removes the early _find_available_port() probe from codex() that treated any busy port as 'needs fallback', even when a healthy proxy was already running on it. Now codex() follows the same selected-port contract as other wrappers (aider, copilot, cursor): _ensure_proxy() decides whether to reuse the requested port or fall back, and all MCP registration, provider config, and env building use the returned actual_port. - Replace early _find_available_port() with _ensure_proxy() call - Add proxy cleanup in finally block (since _launch_tool with no_proxy=True won't know about the proxy we started) - Pass no_proxy=True to _launch_tool to skip duplicate proxy startup - Add regression test verifying _ensure_proxy port flows to _launch_tool Fixes headroom#1406 round 2 review feedback.
270ab5c to
f27f439
Compare
JerrettDavis
left a comment
There was a problem hiding this comment.
Requesting changes for the Codex prepare-only path.
headroom wrap codex --prepare-only now calls _ensure_proxy(...) before the prepare_only return, so a config-only operation starts the proxy and fails if proxy/runtime dependencies are not installed. I reproduced this locally on the PR branch with:
python -m pytest tests/test_cli/test_wrap_helpers.py tests/test_cli/test_wrap_codex.py tests/test_cli/test_wrap_persistent.py -q
Result: helper and persistent tests passed, but 13 Codex prepare/unwrap tests failed because prepare-only tried to start the proxy and hit missing proxy dependencies / module import errors. This is a contract regression: prepare-only should still be able to update Codex config without requiring a live proxy. Please keep the new resolved-port behavior for real launches, but avoid starting the proxy in prepare-only mode; the configured port value is enough for the config update in that path. Add/keep regression coverage that asserts prepare-only does not call _ensure_proxy or _start_proxy.
…oxy startup Prepare-only path now runs MCP/memory/provider config writes using the raw requested port, then returns early. This avoids starting a proxy process for a config-only operation that may lack runtime dependencies. Fixes JerrettDavis review: headroom wrap codex --prepare-only now correctly skips proxy startup while still registering MCP, memory, and provider config.
JerrettDavis
left a comment
There was a problem hiding this comment.
Thanks for fixing the prepare-only path; the original regression is gone. Re-running the broader wrapper target now gets through the prepare-only cases, but there is still one Codex lifecycle regression:
python -m pytest tests/test_cli/test_wrap_helpers.py tests/test_cli/test_wrap_codex.py tests/test_cli/test_wrap_persistent.py -q
Result: 135 passed, 1 failed.
Failure: tests/test_cli/test_wrap_codex.py::test_wrap_codex_memory_launch_failure_unwrap_cleans_memory_only_config now raises FileNotFoundError reading .codex/config.toml. From the diff, codex() now calls _ensure_proxy(...) before the MCP/provider/memory config injection on the normal launch path. If proxy startup fails, we exit before writing the managed config that the unwrap/cleanup path expects to remove. The prepare-only fix should stay, but please preserve the launch-failure cleanup contract: either keep the non-runtime config writes before the proxy startup as before, or update the lifecycle so a failed proxy start still leaves unwrap in the expected state and the existing test passes.
## Description The Per-Project Savings empty state currently shows a hardcoded `ANTHROPIC_BASE_URL: http://127.0.0.1:8787/p/<project-name>`. When the proxy listens on a fallback or custom port, users can copy a broken setup URL from the dashboard. This change derives the hint from the browser's live origin and keeps the existing `/p/<project-name>` suffix used by per-project savings. Closes #1508. Related context: #1406 made non-default proxy ports a normal path, which makes the hardcoded dashboard hint user-visible more often. ## Type of Change - [x] Bug fix (non-breaking change that fixes an issue) - [ ] New feature (non-breaking change that adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Documentation update - [ ] Performance improvement - [ ] Code refactoring (no functional changes) ## Changes Made - Replace the static Per-Project Savings setup hint with an Alpine `x-text` binding that uses `window.location.origin`. - Preserve the `/p/<project-name>` suffix so the displayed path shape stays aligned with the existing per-project routing contract. - Add a Playwright regression that loads the dashboard from non-default origins and asserts the empty state follows the active page origin instead of `8787`. ## Testing - [x] Unit tests pass (`uv run pytest tests/test_dashboard_cache_ttl_playwright.py -k "per_project_setup_url_uses_current_origin" -v`) - [x] Unit tests pass (`uv run pytest tests/test_owned_asset_encoding.py::test_get_dashboard_html_reads_as_utf8 tests/test_proxy_dashboard_stats_cache.py::test_dashboard_uses_cached_stats_and_lazy_history_feed_polling -v`) - [x] Linting passes (`uv run ruff check tests/test_dashboard_cache_ttl_playwright.py`) - [ ] Type checking passes (`uv run mypy headroom`) - [x] New tests added for new functionality when applicable - [x] Manual testing performed ### Test Output ```text uv run pytest tests/test_dashboard_cache_ttl_playwright.py -k "per_project_setup_url_uses_current_origin" -v tests/test_dashboard_cache_ttl_playwright.py::test_dashboard_per_project_setup_url_uses_current_origin PASSED [100%] ================= 1 passed, 1 deselected, 1 warning in 0.82s ================== uv run pytest tests/test_owned_asset_encoding.py::test_get_dashboard_html_reads_as_utf8 tests/test_proxy_dashboard_stats_cache.py::test_dashboard_uses_cached_stats_and_lazy_history_feed_polling -v tests/test_owned_asset_encoding.py::test_get_dashboard_html_reads_as_utf8 PASSED [ 50%] tests/test_proxy_dashboard_stats_cache.py::test_dashboard_uses_cached_stats_and_lazy_history_feed_polling PASSED [100%] ======================== 2 passed, 1 warning in 0.16s ========================= uv run ruff check tests/test_dashboard_cache_ttl_playwright.py All checks passed! ``` ## Real Behavior Proof - Environment: Playwright Chromium dashboard harness, dashboard template served through the existing route interception used by the dashboard tests, no live provider required. - Exact command / steps: `uv run pytest tests/test_dashboard_cache_ttl_playwright.py -k "per_project_setup_url_uses_current_origin" -v` - Observed result: `http://127.0.0.1:8788/dashboard` passed with the new current-origin assertion, and `origin/main` failed the same assertion because the page still rendered `ANTHROPIC_BASE_URL: http://127.0.0.1:8787/p/<project-name>`. A separate browser check against `http://headroom.local:9393/dashboard` also passed on the patched branch. - Not tested: full live `headroom proxy --port 8788` browser validation, unless it is run during implementation. ## Review Readiness - [x] I have performed a self-review - [x] This PR is ready for human review ## Checklist - [x] My code follows the project's style guidelines - [x] I have performed a self-review of my code - [ ] I have commented my code, particularly in hard-to-understand areas - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [x] I have added tests that prove my fix is effective or that my feature works - [x] New and existing unit tests pass locally with my changes - [x] I have updated the CHANGELOG.md if applicable ## Additional Notes `CHANGELOG.md` is intentionally unchanged because this repo generates changelog entries from conventional commits. The documentation checkbox is satisfied by correcting the in-dashboard setup instruction.
…up contract - Resolve 4 merge conflicts in _ensure_proxy and claude() from main's persistent-proxy feature checks and vertex_api_url support. - Move codex config writes (MCP, coding compressor, memory) before _ensure_proxy() so unwrap has config to clean up when proxy startup or binary lookup fails. - Move codex binary check before _ensure_proxy to match original flow. - Update _ensure_proxy callers in test_wrap_persistent and test_wrap_claude_vertex_proxy_env to unpack (proc, actual_port) tuple. - Fix fake_ensure_proxy mock in claude vertex tests to return tuple. Fixes headroomlabs-ai#1406 round 4 review feedback from JerrettDavis.
Previously inject_opencode_provider_config(port) and _build_opencode_launch_env(port) used the raw port before _launch_tool could fall back to a different port. When the requested port was busy, the provider config block and MCP URL pointed at the old port while the proxy was on the fallback port. Now opencode() calls _ensure_proxy directly before config injection (same selected-port contract as codex), then passes no_proxy=True to _launch_tool to skip duplicate proxy startup. Also re-injects MCP config when actual_port != port.
There was a problem hiding this comment.
Pull request overview
This PR refactors headroom wrap proxy startup to avoid stale/orphaned proxy port conflicts by adopting a Vite-style “probe + increment” strategy: if the requested port can’t be bound, it selects the next available port and propagates the chosen port through wrapper setup and tests.
Changes:
- Add
_find_available_port()and update_ensure_proxy()to auto-fallback to a bindable port and return(proc, actual_port)to callers. - Propagate the resolved port through wrapper flows (banner/setup output, env injection, MCP setup) and update Codex/OpenCode flows to use the resolved port.
- Rewrite and adjust CLI tests to match the new port-selection contract.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| headroom/cli/wrap.py | Implements port fallback and updates wrapper flows to use/propagate actual_port. |
| tests/test_cli/test_wrap_persistent.py | Updates persistent proxy tests for new _ensure_proxy return contract and new error path. |
| tests/test_cli/test_wrap_helpers.py | Adds unit tests for _find_available_port and updates mocks for tuple return. |
| tests/test_cli/test_wrap_codex.py | Updates mocks and adds regression coverage for Codex port propagation from _ensure_proxy. |
| tests/test_cli/test_wrap_claude_vertex_proxy_env.py | Updates mocks and assertions for tuple return contract. |
| tests/test_cli/test_main_help_version.py | Updates patched _ensure_proxy return value to tuple. |
| docs/plans/2026-06-25-port-fallback.md | Adds design/implementation plan documentation for the port fallback approach. |
| .hermes/pr-body.md | Adds PR body content mirroring the PR description/testing proof. |
Comments suppressed due to low confidence (1)
headroom/cli/wrap.py:3231
- _launch_tool() now registers the proxy-client marker only after _ensure_proxy completes. If another wrapper exits during this window, it can observe “no other clients” and terminate the shared proxy while this process is starting up.
proxy_holder[0], actual_port = _ensure_proxy(
port,
no_proxy,
learn=learn,
memory=memory,
| for port in range(start_port, start_port + max_attempts): | ||
| error = _port_bind_error(port) | ||
| if error is None: | ||
| return port | ||
| if error.errno not in (errno.EADDRINUSE, errno.EACCES): | ||
| raise error | ||
| raise RuntimeError( | ||
| f"No available port found in range " | ||
| f"{start_port}-{start_port + max_attempts - 1}" | ||
| ) |
| proxy_holder[0], actual_port = _ensure_proxy( | ||
| port, no_proxy, learn=learn, memory=memory, agent_type=agent_type | ||
| ) | ||
| _push_runtime_env(port, no_proxy) | ||
| port_holder[0] = actual_port | ||
| _register_proxy_client(actual_port) |
| @@ -3694,7 +3725,9 @@ | |||
| vertex_api_url=vertex_upstream, | |||
| clear_vertex_api_url=use_vertex and vertex_upstream is None, | |||
| ) | |||
| _push_runtime_env(port, no_proxy) | |||
| port_holder[0] = actual_port | |||
| _register_proxy_client(actual_port) | |||
| _push_runtime_env(actual_port, no_proxy) | |||
| _codex_proxy, actual_port = _ensure_proxy( | ||
| port, | ||
| no_proxy, | ||
| learn=learn, | ||
| memory=memory, | ||
| agent_type="codex", | ||
| code_graph=code_graph, | ||
| backend=backend, | ||
| anyllm_provider=anyllm_provider, | ||
| region=region, | ||
| ) |
| _opencode_proxy, actual_port = _ensure_proxy( | ||
| port, | ||
| no_proxy, | ||
| learn=learn, | ||
| memory=memory, | ||
| agent_type="opencode", | ||
| code_graph=code_graph, | ||
| backend=backend, | ||
| anyllm_provider=anyllm_provider, | ||
| region=region, | ||
| ) |
|
|
||
|
|
||
| def _make_cleanup(proxy_proc_holder: list, port: int = 8787) -> Any: | ||
| def _make_cleanup(proxy_proc_holder: list, port: Any = 8787) -> Any: |
| monkeypatch.setattr( | ||
| wrap_cli, | ||
| "_port_bind_error", | ||
| lambda port: PermissionError(10013, "access denied by OS port reservation"), | ||
| "_find_available_port", | ||
| lambda port, **kw: (_ for _ in ()).throw( | ||
| OSError(errno.EACCES, "access denied by OS port reservation") | ||
| ), |
JerrettDavis
left a comment
There was a problem hiding this comment.
Reviewed the latest port-fallback updates. The prior Codex launch-failure cleanup regression is addressed: prepare-only remains config-only, and the normal Codex launch path now performs the non-runtime config writes before proxy startup or binary launch failure can abort. The selected actual_port from _ensure_proxy() is still used for runtime provider/env wiring, and the OpenCode follow-up now resolves the selected port before provider/env injection and refreshes MCP when the port changes.
Local focused verification passed: python -m pytest tests/test_cli/test_wrap_helpers.py tests/test_cli/test_wrap_codex.py tests/test_cli/test_wrap_persistent.py -q -> 144 passed.
I do not see a remaining code-review blocker. Only governance checks are attached to the latest head, so this approval is based on code review plus the targeted wrapper contract verification rather than a full CI matrix.
- _find_available_port: clamp range at 65536, catch OverflowError - Register proxy-client marker BEFORE _ensure_proxy in all wrappers (codex, opencode, _launch_tool, _run_proxy_only_watcher, claude) - Move marker on port fallback (unregister old, register new) - _make_cleanup: port: Any -> port: int | list[int] - test_wrap_persistent: EACCES -> EADDRNOTAVAIL (skipped by impl) Addresses headroomlabs-ai#1406 round 5 review feedback (Copilot + JerrettDavis)
Description
Replace the 180-line process-killing approach with a 15-line Vite-style socket.bind probe: if port is busy (EADDRINUSE or EACCES), try the next available port.
Pure stdlib -- no /proc, no lsof, no subprocess -- works identically on Linux/macOS/Windows.
Problem
When
headroom wrap <agent>is killed without proper cleanup (window close, SSH timeout, crash), the background proxy becomes orphaned and holds the port. The nextheadroom wrapon the same port would wait 30-45 seconds then fail with a confusing error.This PR takes a simpler, safer approach: find the next available port. No process detection, no killing.
Related issues
_live_proxy_clientsmarker files; this PR doesn't touch that codeType of Change
Changes Made
headroom/cli/wrap.py: Add_find_available_port()-- socket.bind loop that skips EADDRINUSE (busy) and EACCES (reserved/privileged) ports, returns first available port in range. Replace_ensure_proxyport-bind check with auto-fallback call to_find_available_port. Remove_ensure_port_free()call from_start_proxy(). Remove 8 dead functions:_find_process_on_port,_linux_find_process_on_port,_resolve_inode_to_pid,_is_headroom_proxy,_read_process_cmdline,_kill_process,_ensure_port_free,_format_unbindable_port_error.tests/test_cli/test_wrap_helpers.py: Remove 14 oldTestEnsurePortFreetests (mocked /proc parsing, process killing). Add 6 newTestFindAvailablePorttests covering: port free, first port busy, multiple busy, EACCES skipped, unexpected error propagated, range exhausted.tests/test_cli/test_wrap_persistent.py: Adapt persistence tests for_find_available_portmock. Rewrite unbindable-port test to use new error path.Zero new dependencies. Zero changes to core proxy server, MCP, compression, or providers.
Testing
python -m pytest tests/test_cli/ -v)Test Output
Real Behavior Proof
python -m pytest tests/test_cli/test_wrap_helpers.py::TestFindAvailablePort -v-- 6/6 pass for port fallback. Ran full test suitepython -m pytest tests/test_cli/ -q-- 445/445 pass._find_available_port(8787)returns 8787 when free, 8788 when 8787 is busy. EACCES skipped same as EADDRINUSE. Non-retryable errors (EADDRNOTAVAIL) propagate immediately.Review Readiness
Checklist