Skip to content

Fix multi-minute tool-call hangs under heavy parallel load#535

Open
wonderwhy-er wants to merge 5 commits into
mainfrom
fix/tool-call-hang-parallel-load
Open

Fix multi-minute tool-call hangs under heavy parallel load#535
wonderwhy-er wants to merge 5 commits into
mainfrom
fix/tool-call-hang-parallel-load

Conversation

@wonderwhy-er

@wonderwhy-er wonderwhy-er commented Jul 2, 2026

Copy link
Copy Markdown
Owner

Problem

Under heavy parallel file I/O — e.g. several claude -p agents reading/writing files on a slow or cloud-synced filesystem — tool calls could hang for ~4 minutes and then work again on retry. Crucially, even pure in-memory tools like list_processes hung, which pointed past any single tool to a shared bottleneck.

Reported by a user running MCP in Claude Code with ~4–6 parallel subprocesses; reproduced locally.

Root cause

Three compounding issues:

  1. Every successful tool call gated its response on a disk write. The CallTool handler did await usageTracker.trackSuccess(name) before returning, which chains through configManager.setValuefs.writeFile. fs runs on libuv's threadpool (default 4 threads). When parallel reads stalled on a slow filesystem occupied all 4 threads — and stayed occupied — every tool's response was gated, including pure-memory ones.
  2. The default 4-thread pool is trivially exhausted by a handful of slow file operations.
  3. A JS-level timeout (withTimeout) did not cancel the underlying op, so a stalled read held its thread for the full OS duration, and read timeouts undercut / conflicted with the client's own cap.

Each call also fired an independent fs.writeFile of the same config file — a latent corruption risk under concurrency.

Fixes

  • Non-blocking, coalesced, serialized stats persistence. configManager.setValueNonBlocking() updates memory synchronously and schedules a background write on one serialized chain; a burst collapses to at most one queued write. usageTracker.saveStats uses it, so the response never waits on disk. Serialization also removes the concurrent-write corruption risk.
  • Raise UV_THREADPOOL_SIZE (default 16, respects a user-set value) via src/bootstrap.ts, imported first in index.ts — before any fs work is submitted, which is required for libuv to pick it up.
  • Cancellable, 3-minute read timeout. runWithAbortableTimeout() passes an AbortSignal into the read and aborts it on timeout, so a stalled read is cancelled and its fd/thread released instead of leaked. The timeout is a flat 3 minutes, chosen to sit below the MCP client's ~4-minute hard cap: we abort the fs op and return the useful cloud-storage error (buildPermissionError) before the client gives up with an opaque "No result received after 4 minutes". Signal is threaded through ReadOptions → text/image handlers; readFileInternal (the edit_block read, previously untimed) now uses it too. The read_file handler backstop is 3m30s — above the inner read, still below the client cap. Caveat: an in-kernel-wedged syscall only observes the abort once it returns, and library readers (Excel/PDF) ignore the signal — the threadpool headroom covers those.

Verification

Reproductions in test/repro/ (deterministic, kept out of CI):

repro before after
test-threadpool-starvation.js (pool=4, 4 stalled reads) trivial write blocks >5s pool=8 → 4ms
test-dc-tracking-gate.js — real trackSuccess() under starved pool >5000ms 1ms
test-bootstrap-threadpool.js — shipped bootstrap, no env override pool=16, no starvation
test-withtimeout-leak.js / test-read-abort-frees-thread.js timeout fires, read keeps running abort cancels the read (rejects, buffer never returned)

CI regression tests (fast, cross-platform):

  • test/test-nonblocking-config-save.js — non-blocking + coalesced + valid config.json under 100 concurrent writes. PASS (3/3).
  • test/test-read-abort-timeout.js — abort fires + ETIMEDOUT, timeout is 3m and below the client cap, normal read still works. PASS (4/4).

ab-test.test.js (17/0), test-file-handlers.js, test-line-count.js, test-edit-block-line-endings.js all pass; server module loads cleanly.

Note: test-edit-block-occurrences.js Test 3 fails, but it fails identically on main — a pre-existing issue unrelated to this PR.

Follow-ups (not in this PR)

  • interact_with_process re-join()s the whole (≤50MB) output buffer every 50ms poll (terminal-manager.ts), burning ~20% of the event loop for a single active session and scaling with concurrency; track a byte offset instead. Characterized in test/repro/test-interact-join-stall.js.
  • toolHistory flush uses sync appendFileSync/statSync on a 1s interval — a small periodic event-loop hitch, not on the response path.

Summary by CodeRabbit

  • New Features

    • File reads now support cancellation, helping prevent long-running operations from hanging the app.
    • Background config saves are now non-blocking, improving responsiveness during frequent updates.
    • Startup is more resilient, with earlier environment setup and safer handling of unexpected failures.
  • Bug Fixes

    • Improved protection against stalled filesystem work and threadpool starvation.
    • Increased read timeouts so large or slow file operations have more room to complete.
    • Added safer error handling during startup and config persistence.

Under heavy parallel file I/O (e.g. several `claude -p` agents reading and
writing files on a slow or cloud-synced filesystem), tool calls -- even
pure in-memory ones like list_processes -- could hang until the MCP
client's ~4-minute cap, then succeed again on retry.

Root cause (two compounding issues):
- Every successful tool call awaited usageTracker.trackSuccess() before
  returning, which chained through configManager.setValue -> fs.writeFile.
  fs runs on libuv's threadpool (default 4 threads). When parallel reads
  stalled on a slow filesystem occupied all threads -- and stayed occupied,
  since a JS-level timeout does not cancel the underlying syscall -- the
  awaited config write could not get a thread, gating the response of every
  tool, including pure-memory ones.
- The default 4-thread pool is trivially exhausted by a handful of slow
  file operations.

Fixes:
- Stats persist via configManager.setValueNonBlocking(): the in-memory
  update is synchronous, the disk write is coalesced and serialized on a
  single write chain, and the tool-call response never waits on it. The
  serialization also removes a pre-existing risk of overlapping writes
  corrupting config.json.
- Raise UV_THREADPOOL_SIZE (default 16, user-overridable) via a bootstrap
  module imported first in index.ts, before any fs work is submitted.

Tests:
- test/test-nonblocking-config-save.js: fast, cross-platform regression
  test (non-blocking + coalesced + no corruption under concurrent writes).
- test/repro/: FIFO/REPL-based reproductions used to diagnose the hang
  (kept out of the auto-discovered CI suite).
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: de8af635-d010-4ca0-a067-94b3608dc981

📥 Commits

Reviewing files that changed from the base of the PR and between 7acfea1 and eef9932.

📒 Files selected for processing (1)
  • test/test-read-abort-timeout.js
👮 Files not reviewed due to content moderation or server errors (1)
  • test/test-read-abort-timeout.js

📝 Walkthrough
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: preventing long tool-call hangs under heavy parallel load.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/tool-call-hang-parallel-load

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/index.ts (2)

118-136: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Flushed deferred messages always sent at 'info' level, dropping real severity.

msg.level is captured on every deferLog() call (including 'error', 'warning', 'debug' at lines 69-73) but the flush loop hardcodes 'info' for all of them. A real config-load error would surface to the client as an info-level message.

🐛 Proposed fix
       while (deferredMessages.length > 0) {
         const msg = deferredMessages.shift()!;
-        transport.sendLog('info', msg.message);
+        transport.sendLog(msg.level, msg.message);
       }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/index.ts` around lines 118 - 136, The deferred log flush in
server.oninitialized is hardcoding every message to info, which discards the
original severity captured by deferLog. Update the flush loop in src/index.ts to
send each deferred message using its stored msg.level instead of a fixed info
level, and keep the existing deferredMessages/flushDeferredMessages flow intact
so errors, warnings, and debug logs preserve their severity when emitted.

78-111: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Broaden the JSON parse guard and extract it into one helper. Unexpected end of JSON input still falls through here, so truncated/empty payloads will hit process.exit(1), and the same check is duplicated in both handlers. Also avoid resuming after uncaughtException unless this path is guaranteed to leave the process in a clean state.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/index.ts` around lines 78 - 111, The JSON parse handling in the
`process.on('uncaughtException')` and `process.on('unhandledRejection')`
handlers is too narrow and duplicated. Extract the shared JSON-error detection
into a single helper used by both handlers, and broaden it to catch
truncated/empty JSON failures such as “Unexpected end of JSON input” in addition
to the current unexpected-token case. Ensure the `logger.error` branch returns
consistently for all JSON parse errors, and review the `uncaughtException` path
so it does not continue execution unless that path is explicitly safe.
src/utils/usageTracker.ts (1)

361-364: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Use the non-blocking config save path here too. saveOnboardingState and markFeedbackGiven still await configManager.setValue(...), and both are invoked inline on tool-call response paths, so they can reintroduce the same response hang under libuv threadpool saturation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/utils/usageTracker.ts` around lines 361 - 364, The inline config write in
markFeedbackGiven is still blocking the tool-call response path by awaiting
configManager.setValue, which can hang under threadpool saturation. Update
markFeedbackGiven to use the same non-blocking config save path as
saveOnboardingState, and ensure the usageTracker flow no longer awaits
persistence before returning so the response path stays unblocked.
🧹 Nitpick comments (5)
test/repro/test-withtimeout-leak.js (1)

1-40: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Consider extracting shared repro-script boilerplate.

All five files under test/repro/ (test-bootstrap-threadpool.js, test-threadpool-starvation.js, test-env-threadpool-timing.js, test-dc-tracking-gate.js, and this file) duplicate the same fifo-creation, timestamped logger, and guard/process.exit pattern. A small shared helper module (e.g. test/repro/_helpers.js exporting makeFifo(), log(), withGuard()) would reduce copy-paste drift as more repros are added.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/repro/test-withtimeout-leak.js` around lines 1 - 40, The repro scripts
under test/repro duplicate fifo setup, timestamped logging, and the
timeout/guard exit flow, so extract that boilerplate into a shared helper to
prevent drift. Move the repeated pieces into a small reusable module (for
example helper functions used by the current script and the other repro files)
and update this file to use those shared utilities while keeping the existing
repro behavior in place.
src/config-manager.ts (1)

181-219: 🗄️ Data Integrity & Integration | 🔵 Trivial | ⚡ Quick win

Serialization looks correct; consider atomic writes for crash safety.

writeChain correctly prevents overlapping/interleaved writes to config.json (the corruption bug this PR targets). However, writeConfigToDisk still does a direct fs.writeFile overwrite — if the process is killed mid-write, the file can be left truncated/corrupt, independent of the concurrency fix. Since this function is newly introduced, it's a good opportunity to also write-to-temp-then-rename for atomicity.

♻️ Proposed atomic write
   private async writeConfigToDisk(): Promise<void> {
-    await fs.writeFile(this.configPath, JSON.stringify(this.config, null, 2), 'utf8');
+    const tmpPath = `${this.configPath}.tmp-${process.pid}`;
+    await fs.writeFile(tmpPath, JSON.stringify(this.config, null, 2), 'utf8');
+    await fs.rename(tmpPath, this.configPath);
   }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/config-manager.ts` around lines 181 - 219, The serialization fix in
writeConfigToDisk, saveConfig, and scheduleSave prevents overlapping writes, but
the direct fs.writeFile overwrite is still crash-prone. Update writeConfigToDisk
to use an atomic temp-file write followed by rename so a mid-write process crash
cannot leave config.json truncated or corrupt. Keep the existing writeChain
behavior unchanged and ensure both saveConfig and scheduleSave still call the
same atomic write path.
src/index.ts (1)

137-163: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Deferred startup logs are silently dropped on fatal error.

If runServer() throws before server.oninitialized fires (e.g. during server.connect()), any messages already buffered in deferredMessages (via deferLog) — including config-load diagnostics — are never emitted anywhere, since only logger.error/stdout notification are sent here. This removes exactly the diagnostic context needed to debug the failure.

♻️ Proposed fix
   } catch (error) {
     const errorMessage = error instanceof Error ? error.message : String(error);
     logger.error(`FATAL ERROR: ${errorMessage}`);
+    while (deferredMessages.length > 0) {
+      const msg = deferredMessages.shift()!;
+      logger.error(`[deferred] ${msg.level}: ${msg.message}`);
+    }
     if (error instanceof Error && error.stack) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/index.ts` around lines 137 - 163, Fatal startup failures in runServer()
can drop buffered diagnostics from deferredMessages because the catch path only
logs the error and sends a notification. Before process.exit(1), flush any
queued deferLog output from deferredMessages to the logger/stdout in the catch
block around server.connect(), so config-load and other deferred context is
preserved. Use the existing runServer, deferLog, and deferredMessages flow to
emit those messages once on fatal error, then continue with the current error
reporting and exit behavior.
test/test-nonblocking-config-save.js (1)

59-61: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

No watchdog timeout: if the fix regresses, this test hangs instead of failing.

run() has nothing bounding its total duration. If setValueNonBlocking ever regresses to blocking again (the exact bug this test guards against), Promise.all at line 35 would hang indefinitely and the test would never reach a pass/fail verdict — it would just hang the CI job rather than reporting a clear failure.

Consider racing run() against a timeout that exits with a clear failure message.

♻️ Proposed fix
-run()
-  .then(() => { console.log(`\nPASS (${passed}/3)`); process.exit(0); })
-  .catch((e) => { console.error(`\nFAIL: ${e.message}`); process.exit(1); });
+const timeout = new Promise((_, reject) =>
+  setTimeout(() => reject(new Error('test timed out after 10s (possible regression to blocking save)')), 10000)
+);
+
+Promise.race([run(), timeout])
+  .then(() => { console.log(`\nPASS (${passed}/3)`); process.exit(0); })
+  .catch((e) => { console.error(`\nFAIL: ${e.message}`); process.exit(1); });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/test-nonblocking-config-save.js` around lines 59 - 61, The test harness
in run() has no overall timeout, so a regression in setValueNonBlocking or the
Promise.all path can hang CI instead of failing clearly. Update the test runner
around run() to race it against a watchdog timeout and force a clear failure
message if the timeout wins. Keep the change localized to the run().then/.catch
entrypoint and preserve the existing PASS/FAIL output for the normal path.
test/repro/test-interact-join-stall.js (1)

27-55: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Script never fails on stall — only logs a verdict, exits 0 regardless.

maxLag/sumLag are only used to print STALL REPRODUCED/no significant stall observed (Lines 51-53), and process.exit(0) runs unconditionally (Line 55). As written this is diagnostic-only and can't gate CI or catch regressions automatically. If this is meant to double as a regression guard (per the cohort's stated goal of validating the fix), consider setting a non-zero exit code when the stall thresholds are exceeded.

♻️ Proposed exit-code gating
-log(maxLag > 150 || sumLag > wall * 0.3
-  ? `STALL REPRODUCED: per-poll whole-buffer join starves the event loop`
-  : `no significant stall observed`);
+const stalled = maxLag > 150 || sumLag > wall * 0.3;
+log(stalled
+  ? `STALL REPRODUCED: per-poll whole-buffer join starves the event loop`
+  : `no significant stall observed`);
+process.exitCode = stalled ? 1 : 0;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/repro/test-interact-join-stall.js` around lines 27 - 55, The repro
script only prints a stall verdict and always exits successfully, so it cannot
fail CI when the regression reappears. Update the logic around the event-loop
monitor variables maxLag, sumLag, and the phase-2 verdict so that the script
sets a non-zero exit status when the stall thresholds are exceeded, while
keeping the existing diagnostic logs and cleanup via interactWithProcess intact.
Ensure the unconditional process.exit(0) is replaced with conditional
success/failure behavior based on the measured lag.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/repro/test-process-wait-client-cap.js`:
- Around line 50-53: The repro helper functions setCap, snapshotCap, and
restoreCap are mutating the shared singleton config, so isolate this test from
the user’s real config. Update the repro setup in
test-process-wait-client-cap.js to point configManager at a temp HOME/config
location or stub CONFIG_FILE before calling setCap/snapshotCap/restoreCap, and
ensure the isolated path is used throughout the repro lifecycle.

In `@test/test-nonblocking-config-save.js`:
- Around line 28-61: The failure path in run() can exit before pending
non-blocking writes finish and before the test key cleanup runs. Update the test
around configManager.setValueNonBlocking, scheduleSave, and the final cleanup so
the body uses try/finally: always remove __nonblockingSaveRegressionTest with
configManager.setValue(KEY, undefined), and allow any queued writeChain work to
settle before the process exits from the .catch() path.

---

Outside diff comments:
In `@src/index.ts`:
- Around line 118-136: The deferred log flush in server.oninitialized is
hardcoding every message to info, which discards the original severity captured
by deferLog. Update the flush loop in src/index.ts to send each deferred message
using its stored msg.level instead of a fixed info level, and keep the existing
deferredMessages/flushDeferredMessages flow intact so errors, warnings, and
debug logs preserve their severity when emitted.
- Around line 78-111: The JSON parse handling in the
`process.on('uncaughtException')` and `process.on('unhandledRejection')`
handlers is too narrow and duplicated. Extract the shared JSON-error detection
into a single helper used by both handlers, and broaden it to catch
truncated/empty JSON failures such as “Unexpected end of JSON input” in addition
to the current unexpected-token case. Ensure the `logger.error` branch returns
consistently for all JSON parse errors, and review the `uncaughtException` path
so it does not continue execution unless that path is explicitly safe.

In `@src/utils/usageTracker.ts`:
- Around line 361-364: The inline config write in markFeedbackGiven is still
blocking the tool-call response path by awaiting configManager.setValue, which
can hang under threadpool saturation. Update markFeedbackGiven to use the same
non-blocking config save path as saveOnboardingState, and ensure the
usageTracker flow no longer awaits persistence before returning so the response
path stays unblocked.

---

Nitpick comments:
In `@src/config-manager.ts`:
- Around line 181-219: The serialization fix in writeConfigToDisk, saveConfig,
and scheduleSave prevents overlapping writes, but the direct fs.writeFile
overwrite is still crash-prone. Update writeConfigToDisk to use an atomic
temp-file write followed by rename so a mid-write process crash cannot leave
config.json truncated or corrupt. Keep the existing writeChain behavior
unchanged and ensure both saveConfig and scheduleSave still call the same atomic
write path.

In `@src/index.ts`:
- Around line 137-163: Fatal startup failures in runServer() can drop buffered
diagnostics from deferredMessages because the catch path only logs the error and
sends a notification. Before process.exit(1), flush any queued deferLog output
from deferredMessages to the logger/stdout in the catch block around
server.connect(), so config-load and other deferred context is preserved. Use
the existing runServer, deferLog, and deferredMessages flow to emit those
messages once on fatal error, then continue with the current error reporting and
exit behavior.

In `@test/repro/test-interact-join-stall.js`:
- Around line 27-55: The repro script only prints a stall verdict and always
exits successfully, so it cannot fail CI when the regression reappears. Update
the logic around the event-loop monitor variables maxLag, sumLag, and the
phase-2 verdict so that the script sets a non-zero exit status when the stall
thresholds are exceeded, while keeping the existing diagnostic logs and cleanup
via interactWithProcess intact. Ensure the unconditional process.exit(0) is
replaced with conditional success/failure behavior based on the measured lag.

In `@test/repro/test-withtimeout-leak.js`:
- Around line 1-40: The repro scripts under test/repro duplicate fifo setup,
timestamped logging, and the timeout/guard exit flow, so extract that
boilerplate into a shared helper to prevent drift. Move the repeated pieces into
a small reusable module (for example helper functions used by the current script
and the other repro files) and update this file to use those shared utilities
while keeping the existing repro behavior in place.

In `@test/test-nonblocking-config-save.js`:
- Around line 59-61: The test harness in run() has no overall timeout, so a
regression in setValueNonBlocking or the Promise.all path can hang CI instead of
failing clearly. Update the test runner around run() to race it against a
watchdog timeout and force a clear failure message if the timeout wins. Keep the
change localized to the run().then/.catch entrypoint and preserve the existing
PASS/FAIL output for the normal path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 63037ede-608f-4e47-bf94-600caf85d826

📥 Commits

Reviewing files that changed from the base of the PR and between 0ad919b and 07ae5a3.

📒 Files selected for processing (13)
  • src/bootstrap.ts
  • src/config-manager.ts
  • src/index.ts
  • src/utils/usageTracker.ts
  • test/repro/test-bootstrap-threadpool.js
  • test/repro/test-dc-tracking-gate.js
  • test/repro/test-env-threadpool-timing.js
  • test/repro/test-interact-join-stall.js
  • test/repro/test-process-wait-client-cap.js
  • test/repro/test-start-process-timeout-block.js
  • test/repro/test-threadpool-starvation.js
  • test/repro/test-withtimeout-leak.js
  • test/test-nonblocking-config-save.js

Comment on lines +50 to +53
let ORIGINAL_CAP;
async function setCap(v) { await configManager.setValue(CAP_KEY, v); }
async function snapshotCap() { ORIGINAL_CAP = await configManager.getValue(CAP_KEY); }
async function restoreCap() { await configManager.setValue(CAP_KEY, ORIGINAL_CAP); }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect how configManager resolves its persisted config path
rg -n "CONFIG_PATH|configPath|homedir|appData|process\.env\." src/config-manager.ts

Repository: wonderwhy-er/DesktopCommanderMCP

Length of output: 664


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== src/config-manager.ts (selected lines) =="
cat -n src/config-manager.ts | sed -n '1,240p'

echo
echo "== test/repro/test-process-wait-client-cap.js (selected lines) =="
cat -n test/repro/test-process-wait-client-cap.js | sed -n '1,220p'

echo
echo "== locate CONFIG_FILE definition/usages =="
rg -n "CONFIG_FILE|configPath" src test dist

Repository: wonderwhy-er/DesktopCommanderMCP

Length of output: 22940


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== src/config.ts =="
cat -n src/config.ts

echo
echo "== search for config path overrides =="
rg -n "CONFIG_DIR|CONFIG_FILE|desktop-commander|process\.env\.(CONFIG|DESKTOP|HOME|XDG)|homedir\(" src test

Repository: wonderwhy-er/DesktopCommanderMCP

Length of output: 7810


Isolate this repro from the shared config file. setCap/snapshotCap/restoreCap mutate the singleton config at ~/.claude-server-commander/config.json, so this repro can leave a stray maxProcessWaitMs entry or overwrite user settings if it exits before restore. Use a temp HOME/config path (or stub CONFIG_FILE) for the repro.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/repro/test-process-wait-client-cap.js` around lines 50 - 53, The repro
helper functions setCap, snapshotCap, and restoreCap are mutating the shared
singleton config, so isolate this test from the user’s real config. Update the
repro setup in test-process-wait-client-cap.js to point configManager at a temp
HOME/config location or stub CONFIG_FILE before calling
setCap/snapshotCap/restoreCap, and ensure the isolated path is used throughout
the repro lifecycle.

Comment on lines +28 to +61
async function run() {
await configManager.getConfig(); // warm init so disk read is cached

// 1) A burst of non-blocking saves must resolve effectively immediately —
// they must not each wait on a disk write.
const BURST = 100;
const t0 = Date.now();
await Promise.all(
Array.from({ length: BURST }, (_, i) => configManager.setValueNonBlocking(KEY, i))
);
const elapsed = Date.now() - t0;
assert.ok(elapsed < 200, `burst of ${BURST} non-blocking saves took ${elapsed}ms (expected < 200ms)`);
passed++; console.log(`✓ ${BURST} non-blocking saves resolved in ${elapsed}ms`);

// 2) The in-memory value is visible immediately (synchronously updated).
assert.strictEqual(await configManager.getValue(KEY), BURST - 1);
passed++; console.log('✓ in-memory value reflects the latest write immediately');

// 3) After the background flush window, config.json is valid JSON (no torn
// write from overlapping saves) and holds the final coalesced value.
await new Promise((r) => setTimeout(r, 300));
let parsed;
assert.doesNotThrow(() => { parsed = JSON.parse(readFileSync(CONFIG_FILE, 'utf8')); },
'config.json must remain valid JSON after concurrent writes');
assert.strictEqual(parsed[KEY], BURST - 1, 'final value must be persisted to disk');
passed++; console.log('✓ config.json is valid and holds the coalesced final value');

// cleanup: remove the test key (undefined is dropped by JSON.stringify)
await configManager.setValue(KEY, undefined);
}

run()
.then(() => { console.log(`\nPASS (${passed}/3)`); process.exit(0); })
.catch((e) => { console.error(`\nFAIL: ${e.message}`); process.exit(1); });

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Failure path can corrupt the real config.json and leaks the test key.

If any assertion in run() throws (lines 39, 43, 51, or 52), execution jumps straight to the .catch() at line 61, which calls process.exit(1) immediately. At that point:

  • The background write scheduled by the burst's setValueNonBlocking calls (scheduleSave() in src/config-manager.ts) may still be in flight on the writeChain. process.exit() can abort pending async I/O, risking a truncated/torn write to the user's real config.json — the exact corruption scenario this PR's serialized writeChain is meant to prevent.
  • Cleanup at line 56 (setValue(KEY, undefined)) never runs, so __nonblockingSaveRegressionTest is permanently left in the user's actual config file (this test operates directly on CONFIG_FILE, the real ~/.claude-server-commander/config.json, not an isolated fixture).

Wrap the test body in try/finally so cleanup always runs and pending writes are given a chance to settle before exiting.

♻️ Proposed fix
 async function run() {
   await configManager.getConfig(); // warm init so disk read is cached
 
-  // 1) A burst of non-blocking saves must resolve effectively immediately —
-  //    they must not each wait on a disk write.
-  const BURST = 100;
-  const t0 = Date.now();
-  await Promise.all(
-    Array.from({ length: BURST }, (_, i) => configManager.setValueNonBlocking(KEY, i))
-  );
-  const elapsed = Date.now() - t0;
-  assert.ok(elapsed < 200, `burst of ${BURST} non-blocking saves took ${elapsed}ms (expected < 200ms)`);
-  passed++; console.log(`✓ ${BURST} non-blocking saves resolved in ${elapsed}ms`);
-
-  // 2) The in-memory value is visible immediately (synchronously updated).
-  assert.strictEqual(await configManager.getValue(KEY), BURST - 1);
-  passed++; console.log('✓ in-memory value reflects the latest write immediately');
-
-  // 3) After the background flush window, config.json is valid JSON (no torn
-  //    write from overlapping saves) and holds the final coalesced value.
-  await new Promise((r) => setTimeout(r, 300));
-  let parsed;
-  assert.doesNotThrow(() => { parsed = JSON.parse(readFileSync(CONFIG_FILE, 'utf8')); },
-    'config.json must remain valid JSON after concurrent writes');
-  assert.strictEqual(parsed[KEY], BURST - 1, 'final value must be persisted to disk');
-  passed++; console.log('✓ config.json is valid and holds the coalesced final value');
-
-  // cleanup: remove the test key (undefined is dropped by JSON.stringify)
-  await configManager.setValue(KEY, undefined);
+  const BURST = 100;
+  try {
+    // 1) A burst of non-blocking saves must resolve effectively immediately —
+    //    they must not each wait on a disk write.
+    const t0 = Date.now();
+    await Promise.all(
+      Array.from({ length: BURST }, (_, i) => configManager.setValueNonBlocking(KEY, i))
+    );
+    const elapsed = Date.now() - t0;
+    assert.ok(elapsed < 200, `burst of ${BURST} non-blocking saves took ${elapsed}ms (expected < 200ms)`);
+    passed++; console.log(`✓ ${BURST} non-blocking saves resolved in ${elapsed}ms`);
+
+    // 2) The in-memory value is visible immediately (synchronously updated).
+    assert.strictEqual(await configManager.getValue(KEY), BURST - 1);
+    passed++; console.log('✓ in-memory value reflects the latest write immediately');
+
+    // 3) After the background flush window, config.json is valid JSON (no torn
+    //    write from overlapping saves) and holds the final coalesced value.
+    await new Promise((r) => setTimeout(r, 300));
+    let parsed;
+    assert.doesNotThrow(() => { parsed = JSON.parse(readFileSync(CONFIG_FILE, 'utf8')); },
+      'config.json must remain valid JSON after concurrent writes');
+    assert.strictEqual(parsed[KEY], BURST - 1, 'final value must be persisted to disk');
+    passed++; console.log('✓ config.json is valid and holds the coalesced final value');
+  } finally {
+    // cleanup: always remove the test key, even on assertion failure, and
+    // await it so no torn write to the real config.json is left in flight.
+    await configManager.setValue(KEY, undefined).catch(() => {});
+  }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async function run() {
await configManager.getConfig(); // warm init so disk read is cached
// 1) A burst of non-blocking saves must resolve effectively immediately —
// they must not each wait on a disk write.
const BURST = 100;
const t0 = Date.now();
await Promise.all(
Array.from({ length: BURST }, (_, i) => configManager.setValueNonBlocking(KEY, i))
);
const elapsed = Date.now() - t0;
assert.ok(elapsed < 200, `burst of ${BURST} non-blocking saves took ${elapsed}ms (expected < 200ms)`);
passed++; console.log(`✓ ${BURST} non-blocking saves resolved in ${elapsed}ms`);
// 2) The in-memory value is visible immediately (synchronously updated).
assert.strictEqual(await configManager.getValue(KEY), BURST - 1);
passed++; console.log('✓ in-memory value reflects the latest write immediately');
// 3) After the background flush window, config.json is valid JSON (no torn
// write from overlapping saves) and holds the final coalesced value.
await new Promise((r) => setTimeout(r, 300));
let parsed;
assert.doesNotThrow(() => { parsed = JSON.parse(readFileSync(CONFIG_FILE, 'utf8')); },
'config.json must remain valid JSON after concurrent writes');
assert.strictEqual(parsed[KEY], BURST - 1, 'final value must be persisted to disk');
passed++; console.log('✓ config.json is valid and holds the coalesced final value');
// cleanup: remove the test key (undefined is dropped by JSON.stringify)
await configManager.setValue(KEY, undefined);
}
run()
.then(() => { console.log(`\nPASS (${passed}/3)`); process.exit(0); })
.catch((e) => { console.error(`\nFAIL: ${e.message}`); process.exit(1); });
async function run() {
await configManager.getConfig(); // warm init so disk read is cached
const BURST = 100;
try {
// 1) A burst of non-blocking saves must resolve effectively immediately —
// they must not each wait on a disk write.
const t0 = Date.now();
await Promise.all(
Array.from({ length: BURST }, (_, i) => configManager.setValueNonBlocking(KEY, i))
);
const elapsed = Date.now() - t0;
assert.ok(elapsed < 200, `burst of ${BURST} non-blocking saves took ${elapsed}ms (expected < 200ms)`);
passed++; console.log(`✓ ${BURST} non-blocking saves resolved in ${elapsed}ms`);
// 2) The in-memory value is visible immediately (synchronously updated).
assert.strictEqual(await configManager.getValue(KEY), BURST - 1);
passed++; console.log('✓ in-memory value reflects the latest write immediately');
// 3) After the background flush window, config.json is valid JSON (no torn
// write from overlapping saves) and holds the final coalesced value.
await new Promise((r) => setTimeout(r, 300));
let parsed;
assert.doesNotThrow(() => { parsed = JSON.parse(readFileSync(CONFIG_FILE, 'utf8')); },
'config.json must remain valid JSON after concurrent writes');
assert.strictEqual(parsed[KEY], BURST - 1, 'final value must be persisted to disk');
passed++; console.log('✓ config.json is valid and holds the coalesced final value');
} finally {
// cleanup: always remove the test key, even on assertion failure, and
// await it so no torn write to the real config.json is left in flight.
await configManager.setValue(KEY, undefined).catch(() => {});
}
}
run()
.then(() => { console.log(`\nPASS (${passed}/3)`); process.exit(0); })
.catch((e) => { console.error(`\nFAIL: ${e.message}`); process.exit(1); });
🧰 Tools
🪛 ast-grep (0.44.0)

[warning] 47-47: Avoid using the initial state variable in setState
Context: setTimeout(r, 300)
Note: [CWE-710] Improper Adherence to Coding Standards. Security best practice.

(setstate-same-var)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/test-nonblocking-config-save.js` around lines 28 - 61, The failure path
in run() can exit before pending non-blocking writes finish and before the test
key cleanup runs. Update the test around configManager.setValueNonBlocking,
scheduleSave, and the final cleanup so the body uses try/finally: always remove
__nonblockingSaveRegressionTest with configManager.setValue(KEY, undefined), and
allow any queued writeChain work to settle before the process exits from the
.catch() path.

Follow-up to the parallel-load hang fix. withTimeout() rejected on a timer
but left the underlying fs op running, so a stalled read kept holding its
libuv thread until the OS call returned. And the read timeout was a fixed
30s, which conflates "stalled" with "large file on slow media".

Changes:
- runWithAbortableTimeout(fn(signal), ms, name): passes an AbortSignal into
  the operation and aborts it on timeout, so signal-honoring reads are
  cancelled and their fd/thread released. Rejects with code ETIMEDOUT.
- Read timeout is now size-aware: FLOOR (30s stall detector for small files)
  + size / assumed-throughput (>=4 MB/s) capped at 10 min. A 1GB file gets
  ~4.8 min instead of dying at 30s; a wedged tiny file still fails fast.
- Thread AbortSignal through the read path: ReadOptions.signal ->
  TextFileHandler (createReadStream/readFile/fd loops) and ImageFileHandler.
- readFileInternal (edit_block's whole-file read) had NO timeout at all; it
  now uses the same size-aware, cancellable read.
- Raise the read_file handler backstop (60s -> 15min) so it sits above the
  size-aware inner cap and never cuts a large-file read short.

Caveat (documented in code): a syscall wedged in-kernel on a cold cloud file
only observes the abort once it returns; library readers (Excel/PDF) ignore
the signal and still get the timeout but keep running in the background. The
threadpool headroom from the previous commit covers those cases.

Tests:
- test/test-read-abort-timeout.js (CI): abort fires + ETIMEDOUT, size-aware
  timeout floor/scale/cap, normal read still works. 4/4.
- test/repro/test-read-abort-frees-thread.js: proves the signal reaches fs
  and cancels a real read (rejects instead of returning the buffer).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/tools/filesystem.ts (1)

504-522: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Timebox the pre-read fs.stat calls — This repo targets Node >=18, and fsPromises.stat() still doesn’t accept signal there, so both the directory check and size probe can block before the abortable read starts on slow mounts. Wrap each stat in a bounded timeout/race so callers don’t wait on the full mount latency.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/tools/filesystem.ts` around lines 504 - 522, The pre-read fs.stat probes
in the filesystem read flow can still block on slow mounts before the abortable
read begins. Update the stat checks in the validation and size-probe path of the
file read logic in filesystem.ts to use a bounded timeout/race so they fail fast
instead of waiting on the full mount latency. Keep the existing read behavior
and telemetry, but ensure the stat calls in the relevant helper flow are
time-limited.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/tools/filesystem.ts`:
- Around line 504-522: The pre-read fs.stat probes in the filesystem read flow
can still block on slow mounts before the abortable read begins. Update the stat
checks in the validation and size-probe path of the file read logic in
filesystem.ts to use a bounded timeout/race so they fail fast instead of waiting
on the full mount latency. Keep the existing read behavior and telemetry, but
ensure the stat calls in the relevant helper flow are time-limited.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: db27ffff-72d2-4f6e-8374-8f34c05f17f6

📥 Commits

Reviewing files that changed from the base of the PR and between 07ae5a3 and 09a26e3.

📒 Files selected for processing (8)
  • src/handlers/filesystem-handlers.ts
  • src/tools/filesystem.ts
  • src/utils/files/base.ts
  • src/utils/files/image.ts
  • src/utils/files/text.ts
  • src/utils/withTimeout.ts
  • test/repro/test-read-abort-frees-thread.js
  • test/test-read-abort-timeout.js

Per review: replace the size-aware read timeout with a flat 3 minutes.
The MCP client hard-caps a tool call at ~4 minutes, so timing out at 3m
lets us abort the fs op (via runWithAbortableTimeout) and return the
useful cloud-storage error before the client reports an opaque timeout.
The AbortController cancellation is unchanged; only the duration/shape of
the budget changed. readFileFromDisk and readFileInternal both use it;
the read_file handler backstop is 3m30s (above the inner read, below the
client cap). computeReadTimeoutMs removed.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/tools/filesystem.ts (2)

542-559: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Stale dead-code branch: isWithTimeoutString no longer applies to this call site.

This catch block still checks for the old withTimeout plain-string rejection format, but readOperation is now run via runWithAbortableTimeout (Lines 545-549), which per its contract always rejects with a proper Error carrying .code = 'ETIMEDOUT' — never a string. isWithTimeoutString will always be false here now, so the comment (Lines 552-554) describing the string-rejection case is misleading for this specific call site and the branch is effectively dead.

♻️ Suggested cleanup
     } catch (error) {
         const err = error as NodeJS.ErrnoException;
-        // withTimeout rejects with a plain string "__ERROR__: ... timed out after N seconds"
-        // when defaultValue is null — it has no .code property, so check for that too.
-        const isWithTimeoutString = typeof error === 'string' && (error as string).startsWith('__ERROR__:');
-        if (isWithTimeoutString || err.code === 'EPERM' || err.code === 'EACCES' || err.code === 'ETIMEDOUT') {
-            throw buildPermissionError(filePath, isWithTimeoutString ? 'ETIMEDOUT' : err.code);
+        if (err.code === 'EPERM' || err.code === 'EACCES' || err.code === 'ETIMEDOUT') {
+            throw buildPermissionError(filePath, err.code);
         }
         throw error;
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/tools/filesystem.ts` around lines 542 - 559, In the read path handled by
`runWithAbortableTimeout` inside `readOperation`, remove the stale
`isWithTimeoutString` handling from the catch block because this helper now
always rejects with an `Error` and `.code = 'ETIMEDOUT'`. Update the
`buildPermissionError` call and surrounding comments in `filesystem.ts` to
reflect the current contract, and keep only the error-code checks that can
actually occur here.

23-29: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Tighten the timeout comments and remove the stale string fallback

  • src/tools/filesystem.ts:23-29, 506-508, 614-621 should avoid saying the fd/thread is freed immediately; runWithAbortableTimeout fails the call fast, but a blocked syscall can still keep running underneath.
  • src/tools/filesystem.ts:552-556 still has leftover withTimeout string handling here; this path now raises ETIMEDOUT errors.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/tools/filesystem.ts` around lines 23 - 29, Update the timeout
documentation around READ_OPERATION_TIMEOUT_MS and the related filesystem read
path so it no longer claims the fd/thread is released immediately; describe it
as failing fast via runWithAbortableTimeout while the underlying syscall may
still continue briefly. Also remove the stale withTimeout string fallback in the
read/error handling path that now raises ETIMEDOUT, and update the affected
logic in the filesystem read helper to only handle the current error shape.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/tools/filesystem.ts`:
- Around line 542-559: In the read path handled by `runWithAbortableTimeout`
inside `readOperation`, remove the stale `isWithTimeoutString` handling from the
catch block because this helper now always rejects with an `Error` and `.code =
'ETIMEDOUT'`. Update the `buildPermissionError` call and surrounding comments in
`filesystem.ts` to reflect the current contract, and keep only the error-code
checks that can actually occur here.
- Around line 23-29: Update the timeout documentation around
READ_OPERATION_TIMEOUT_MS and the related filesystem read path so it no longer
claims the fd/thread is released immediately; describe it as failing fast via
runWithAbortableTimeout while the underlying syscall may still continue briefly.
Also remove the stale withTimeout string fallback in the read/error handling
path that now raises ETIMEDOUT, and update the affected logic in the filesystem
read helper to only handle the current error shape.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cb13c4e9-b9fa-4f1d-83b2-acf3fa66c6ff

📥 Commits

Reviewing files that changed from the base of the PR and between 09a26e3 and e7ad435.

📒 Files selected for processing (3)
  • src/handlers/filesystem-handlers.ts
  • src/tools/filesystem.ts
  • test/test-read-abort-timeout.js

readFileFromDisk now times out via runWithAbortableTimeout, which rejects
with a coded Error (.code === 'ETIMEDOUT') rather than the legacy
'__ERROR__:' string that withTimeout(..., null) produced. The string check
in the catch was therefore unreachable; the existing ETIMEDOUT branch
already handles it.
The integration case read its own path via readFile, which enforces
allowedDirectories from the on-disk config. When another suite test had
left allowedDirectories pointing elsewhere (e.g. test/test_output), the
self-read was rejected and the test failed depending on ambient config.

It now creates its own temp dir, sets allowedDirectories to it (realpath'd
to avoid the macOS /tmp symlink mismatch), reads a temp file, and restores
the original config in a finally. Also dropped the now-unused __dirname/
fileURLToPath.

@edgarsskore edgarsskore left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants