Skip to content

fix(proxy): surface codex websocket loop failures in livez#1727

Merged
chopratejas merged 2 commits into
headroomlabs-ai:mainfrom
rodboev:pr/1720-responses-ws-livez-wedge
Jul 3, 2026
Merged

fix(proxy): surface codex websocket loop failures in livez#1727
chopratejas merged 2 commits into
headroomlabs-ai:mainfrom
rodboev:pr/1720-responses-ws-livez-wedge

Conversation

@rodboev

@rodboev rodboev commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Description

Codex /v1/responses WebSocket disconnects can trigger a known websockets callback failure before connection_made() initializes recv_messages. When that happens, the proxy process can stay alive while /livez keeps advertising a clean healthy state. This change contains that known callback failure in the proxy runtime, records loop callback health, and makes /livez report the degraded state instead of always returning a clean process-alive payload. Closes #1720

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Performance improvement
  • Code refactoring (no functional changes)

Changes Made

  • Add a proxy-owned asyncio loop exception handler that recognizes the known websockets connection_lost ClientConnection.recv_messages AttributeError, records it in bounded runtime health state, and leaves unrelated loop exceptions delegated to the previous or default handler.
  • Extend /livez so the route remains cheap and unauthenticated while reflecting recorded event-loop callback health instead of always reporting a clean process-alive payload.
  • Preserve existing Codex WebSocket relay, fallback, session deregistration, and termination-cause behavior for normal handler-owned failures.
  • Add focused regression coverage for the known callback failure, the negative-space delegation path, and the health route response after loop callback degradation.

Testing

  • Unit tests pass (uv run pytest tests/test_proxy_healthchecks.py tests/test_openai_codex_ws_lifecycle.py tests/test_proxy_loop_exception_health.py -q)
  • Linting passes (uv run ruff check headroom/proxy/server.py tests/test_proxy_healthchecks.py tests/test_openai_codex_ws_lifecycle.py tests/test_proxy_loop_exception_health.py and uv run ruff format --check headroom/proxy/server.py tests/test_proxy_healthchecks.py tests/test_openai_codex_ws_lifecycle.py tests/test_proxy_loop_exception_health.py)
  • Type checking passes (uv run mypy headroom)
  • New tests added for new functionality when applicable
  • Manual testing performed

Test Output

uv run pytest tests/test_proxy_healthchecks.py tests/test_openai_codex_ws_lifecycle.py tests/test_proxy_loop_exception_health.py -q

============================= test session starts =============================
platform win32 -- Python 3.12.13, pytest-9.0.3, pluggy-1.6.0
rootdir: D:\Repos\headroom-pr-1720-responses-ws-livez-wedge
configfile: pyproject.toml
plugins: anyio-4.12.1, langsmith-0.9.3, asyncio-1.3.0, cov-7.0.0
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 33 items

tests\test_proxy_healthchecks.py ............                            [ 36%]
tests\test_openai_codex_ws_lifecycle.py ...................              [ 93%]
tests\test_proxy_loop_exception_health.py ..                             [100%]

============================== warnings summary ===============================
.venv\Lib\site-packages\fastapi\testclient.py:1
  D:\Repos\headroom-pr-1720-responses-ws-livez-wedge\.venv\Lib\site-packages\fastapi\testclient.py:1: StarletteDeprecationWarning: Using `httpx` with `starlette.testclient` is deprecated; install `httpx2` instead.
    from starlette.testclient import TestClient as TestClient  # noqa

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================== 33 passed, 1 warning in 9.78s ========================

uv run ruff check headroom/proxy/server.py tests/test_proxy_healthchecks.py tests/test_openai_codex_ws_lifecycle.py tests/test_proxy_loop_exception_health.py

All checks passed!

uv run ruff format --check headroom/proxy/server.py tests/test_proxy_healthchecks.py tests/test_openai_codex_ws_lifecycle.py tests/test_proxy_loop_exception_health.py

4 files already formatted

Real Behavior Proof

  • Environment: Python proxy runtime with FastAPI TestClient, no external OpenAI credentials required.
  • Exact command / steps: invoke the installed loop exception handler with an asyncio context matching Connection.connection_lost plus AttributeError("'ClientConnection' object has no attribute 'recv_messages'"), then request /livez.
  • Observed result: the known websockets callback failure is recorded without delegating to the noisy default handler, /livez reports degraded loop callback health (HTTP 503, "status": "unhealthy", "alive": false), and unrelated callback exceptions still reach the delegated handler.
  • Not tested: the nondeterministic upstream CPython or websockets timing edge against a live network connection; the focused regression pins the callback shape reported in [BUG] Proxy silently wedges (no crash, /livez hangs) after websockets connection_lost AttributeError #1720.

Review Readiness

  • I have performed a self-review
  • This PR is ready for human review

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have updated the CHANGELOG.md if applicable

Additional Notes

CHANGELOG.md is release-managed from conventional commits, so this PR does not edit it manually. The scope stays inside the proxy runtime and Codex WebSocket dispatch path; it does not change compression, CCR, provider-neutral pipeline behavior, or generic transform modules.

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

PR governance

This PR follows the template and is marked ready for human review.

@github-actions github-actions Bot added the status: ready for review Pull request body is complete and the author marked it ready for human review label Jul 3, 2026
@rodboev rodboev force-pushed the pr/1720-responses-ws-livez-wedge branch from 66decaf to 48baf78 Compare July 3, 2026 05:00
@rodboev rodboev force-pushed the pr/1720-responses-ws-livez-wedge branch from 48baf78 to a884be0 Compare July 3, 2026 06:02
@github-actions github-actions Bot added status: ci failing Required or reported CI checks are failing and removed status: ready for review Pull request body is complete and the author marked it ready for human review labels Jul 3, 2026
@github-actions github-actions Bot added status: ready for review Pull request body is complete and the author marked it ready for human review and removed status: ci failing Required or reported CI checks are failing labels Jul 3, 2026
@chopratejas chopratejas merged commit ceae879 into headroomlabs-ai:main Jul 3, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status: ready for review Pull request body is complete and the author marked it ready for human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Proxy silently wedges (no crash, /livez hangs) after websockets connection_lost AttributeError

2 participants