Fix multi-minute tool-call hangs under heavy parallel load#535
Fix multi-minute tool-call hangs under heavy parallel load#535wonderwhy-er wants to merge 5 commits into
Conversation
Under heavy parallel file I/O (e.g. several `claude -p` agents reading and writing files on a slow or cloud-synced filesystem), tool calls -- even pure in-memory ones like list_processes -- could hang until the MCP client's ~4-minute cap, then succeed again on retry. Root cause (two compounding issues): - Every successful tool call awaited usageTracker.trackSuccess() before returning, which chained through configManager.setValue -> fs.writeFile. fs runs on libuv's threadpool (default 4 threads). When parallel reads stalled on a slow filesystem occupied all threads -- and stayed occupied, since a JS-level timeout does not cancel the underlying syscall -- the awaited config write could not get a thread, gating the response of every tool, including pure-memory ones. - The default 4-thread pool is trivially exhausted by a handful of slow file operations. Fixes: - Stats persist via configManager.setValueNonBlocking(): the in-memory update is synchronous, the disk write is coalesced and serialized on a single write chain, and the tool-call response never waits on it. The serialization also removes a pre-existing risk of overlapping writes corrupting config.json. - Raise UV_THREADPOOL_SIZE (default 16, user-overridable) via a bootstrap module imported first in index.ts, before any fs work is submitted. Tests: - test/test-nonblocking-config-save.js: fast, cross-platform regression test (non-blocking + coalesced + no corruption under concurrent writes). - test/repro/: FIFO/REPL-based reproductions used to diagnose the hang (kept out of the auto-discovered CI suite).
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
👮 Files not reviewed due to content moderation or server errors (1)
📝 Walkthrough🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/index.ts (2)
118-136: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winFlushed deferred messages always sent at 'info' level, dropping real severity.
msg.levelis captured on everydeferLog()call (including'error','warning','debug'at lines 69-73) but the flush loop hardcodes'info'for all of them. A real config-load error would surface to the client as an info-level message.🐛 Proposed fix
while (deferredMessages.length > 0) { const msg = deferredMessages.shift()!; - transport.sendLog('info', msg.message); + transport.sendLog(msg.level, msg.message); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/index.ts` around lines 118 - 136, The deferred log flush in server.oninitialized is hardcoding every message to info, which discards the original severity captured by deferLog. Update the flush loop in src/index.ts to send each deferred message using its stored msg.level instead of a fixed info level, and keep the existing deferredMessages/flushDeferredMessages flow intact so errors, warnings, and debug logs preserve their severity when emitted.
78-111: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winBroaden the JSON parse guard and extract it into one helper.
Unexpected end of JSON inputstill falls through here, so truncated/empty payloads will hitprocess.exit(1), and the same check is duplicated in both handlers. Also avoid resuming afteruncaughtExceptionunless this path is guaranteed to leave the process in a clean state.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/index.ts` around lines 78 - 111, The JSON parse handling in the `process.on('uncaughtException')` and `process.on('unhandledRejection')` handlers is too narrow and duplicated. Extract the shared JSON-error detection into a single helper used by both handlers, and broaden it to catch truncated/empty JSON failures such as “Unexpected end of JSON input” in addition to the current unexpected-token case. Ensure the `logger.error` branch returns consistently for all JSON parse errors, and review the `uncaughtException` path so it does not continue execution unless that path is explicitly safe.src/utils/usageTracker.ts (1)
361-364: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winUse the non-blocking config save path here too.
saveOnboardingStateandmarkFeedbackGivenstill awaitconfigManager.setValue(...), and both are invoked inline on tool-call response paths, so they can reintroduce the same response hang under libuv threadpool saturation.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/utils/usageTracker.ts` around lines 361 - 364, The inline config write in markFeedbackGiven is still blocking the tool-call response path by awaiting configManager.setValue, which can hang under threadpool saturation. Update markFeedbackGiven to use the same non-blocking config save path as saveOnboardingState, and ensure the usageTracker flow no longer awaits persistence before returning so the response path stays unblocked.
🧹 Nitpick comments (5)
test/repro/test-withtimeout-leak.js (1)
1-40: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winConsider extracting shared repro-script boilerplate.
All five files under
test/repro/(test-bootstrap-threadpool.js, test-threadpool-starvation.js, test-env-threadpool-timing.js, test-dc-tracking-gate.js, and this file) duplicate the same fifo-creation, timestamped logger, and guard/process.exitpattern. A small shared helper module (e.g.test/repro/_helpers.jsexportingmakeFifo(),log(),withGuard()) would reduce copy-paste drift as more repros are added.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/repro/test-withtimeout-leak.js` around lines 1 - 40, The repro scripts under test/repro duplicate fifo setup, timestamped logging, and the timeout/guard exit flow, so extract that boilerplate into a shared helper to prevent drift. Move the repeated pieces into a small reusable module (for example helper functions used by the current script and the other repro files) and update this file to use those shared utilities while keeping the existing repro behavior in place.src/config-manager.ts (1)
181-219: 🗄️ Data Integrity & Integration | 🔵 Trivial | ⚡ Quick winSerialization looks correct; consider atomic writes for crash safety.
writeChaincorrectly prevents overlapping/interleaved writes toconfig.json(the corruption bug this PR targets). However,writeConfigToDiskstill does a directfs.writeFileoverwrite — if the process is killed mid-write, the file can be left truncated/corrupt, independent of the concurrency fix. Since this function is newly introduced, it's a good opportunity to also write-to-temp-then-rename for atomicity.♻️ Proposed atomic write
private async writeConfigToDisk(): Promise<void> { - await fs.writeFile(this.configPath, JSON.stringify(this.config, null, 2), 'utf8'); + const tmpPath = `${this.configPath}.tmp-${process.pid}`; + await fs.writeFile(tmpPath, JSON.stringify(this.config, null, 2), 'utf8'); + await fs.rename(tmpPath, this.configPath); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/config-manager.ts` around lines 181 - 219, The serialization fix in writeConfigToDisk, saveConfig, and scheduleSave prevents overlapping writes, but the direct fs.writeFile overwrite is still crash-prone. Update writeConfigToDisk to use an atomic temp-file write followed by rename so a mid-write process crash cannot leave config.json truncated or corrupt. Keep the existing writeChain behavior unchanged and ensure both saveConfig and scheduleSave still call the same atomic write path.src/index.ts (1)
137-163: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winDeferred startup logs are silently dropped on fatal error.
If
runServer()throws beforeserver.oninitializedfires (e.g. duringserver.connect()), any messages already buffered indeferredMessages(viadeferLog) — including config-load diagnostics — are never emitted anywhere, since onlylogger.error/stdout notification are sent here. This removes exactly the diagnostic context needed to debug the failure.♻️ Proposed fix
} catch (error) { const errorMessage = error instanceof Error ? error.message : String(error); logger.error(`FATAL ERROR: ${errorMessage}`); + while (deferredMessages.length > 0) { + const msg = deferredMessages.shift()!; + logger.error(`[deferred] ${msg.level}: ${msg.message}`); + } if (error instanceof Error && error.stack) {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/index.ts` around lines 137 - 163, Fatal startup failures in runServer() can drop buffered diagnostics from deferredMessages because the catch path only logs the error and sends a notification. Before process.exit(1), flush any queued deferLog output from deferredMessages to the logger/stdout in the catch block around server.connect(), so config-load and other deferred context is preserved. Use the existing runServer, deferLog, and deferredMessages flow to emit those messages once on fatal error, then continue with the current error reporting and exit behavior.test/test-nonblocking-config-save.js (1)
59-61: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winNo watchdog timeout: if the fix regresses, this test hangs instead of failing.
run()has nothing bounding its total duration. IfsetValueNonBlockingever regresses to blocking again (the exact bug this test guards against),Promise.allat line 35 would hang indefinitely and the test would never reach apass/failverdict — it would just hang the CI job rather than reporting a clear failure.Consider racing
run()against a timeout that exits with a clear failure message.♻️ Proposed fix
-run() - .then(() => { console.log(`\nPASS (${passed}/3)`); process.exit(0); }) - .catch((e) => { console.error(`\nFAIL: ${e.message}`); process.exit(1); }); +const timeout = new Promise((_, reject) => + setTimeout(() => reject(new Error('test timed out after 10s (possible regression to blocking save)')), 10000) +); + +Promise.race([run(), timeout]) + .then(() => { console.log(`\nPASS (${passed}/3)`); process.exit(0); }) + .catch((e) => { console.error(`\nFAIL: ${e.message}`); process.exit(1); });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/test-nonblocking-config-save.js` around lines 59 - 61, The test harness in run() has no overall timeout, so a regression in setValueNonBlocking or the Promise.all path can hang CI instead of failing clearly. Update the test runner around run() to race it against a watchdog timeout and force a clear failure message if the timeout wins. Keep the change localized to the run().then/.catch entrypoint and preserve the existing PASS/FAIL output for the normal path.test/repro/test-interact-join-stall.js (1)
27-55: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winScript never fails on stall — only logs a verdict, exits 0 regardless.
maxLag/sumLagare only used to printSTALL REPRODUCED/no significant stall observed(Lines 51-53), andprocess.exit(0)runs unconditionally (Line 55). As written this is diagnostic-only and can't gate CI or catch regressions automatically. If this is meant to double as a regression guard (per the cohort's stated goal of validating the fix), consider setting a non-zero exit code when the stall thresholds are exceeded.♻️ Proposed exit-code gating
-log(maxLag > 150 || sumLag > wall * 0.3 - ? `STALL REPRODUCED: per-poll whole-buffer join starves the event loop` - : `no significant stall observed`); +const stalled = maxLag > 150 || sumLag > wall * 0.3; +log(stalled + ? `STALL REPRODUCED: per-poll whole-buffer join starves the event loop` + : `no significant stall observed`); +process.exitCode = stalled ? 1 : 0;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/repro/test-interact-join-stall.js` around lines 27 - 55, The repro script only prints a stall verdict and always exits successfully, so it cannot fail CI when the regression reappears. Update the logic around the event-loop monitor variables maxLag, sumLag, and the phase-2 verdict so that the script sets a non-zero exit status when the stall thresholds are exceeded, while keeping the existing diagnostic logs and cleanup via interactWithProcess intact. Ensure the unconditional process.exit(0) is replaced with conditional success/failure behavior based on the measured lag.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/repro/test-process-wait-client-cap.js`:
- Around line 50-53: The repro helper functions setCap, snapshotCap, and
restoreCap are mutating the shared singleton config, so isolate this test from
the user’s real config. Update the repro setup in
test-process-wait-client-cap.js to point configManager at a temp HOME/config
location or stub CONFIG_FILE before calling setCap/snapshotCap/restoreCap, and
ensure the isolated path is used throughout the repro lifecycle.
In `@test/test-nonblocking-config-save.js`:
- Around line 28-61: The failure path in run() can exit before pending
non-blocking writes finish and before the test key cleanup runs. Update the test
around configManager.setValueNonBlocking, scheduleSave, and the final cleanup so
the body uses try/finally: always remove __nonblockingSaveRegressionTest with
configManager.setValue(KEY, undefined), and allow any queued writeChain work to
settle before the process exits from the .catch() path.
---
Outside diff comments:
In `@src/index.ts`:
- Around line 118-136: The deferred log flush in server.oninitialized is
hardcoding every message to info, which discards the original severity captured
by deferLog. Update the flush loop in src/index.ts to send each deferred message
using its stored msg.level instead of a fixed info level, and keep the existing
deferredMessages/flushDeferredMessages flow intact so errors, warnings, and
debug logs preserve their severity when emitted.
- Around line 78-111: The JSON parse handling in the
`process.on('uncaughtException')` and `process.on('unhandledRejection')`
handlers is too narrow and duplicated. Extract the shared JSON-error detection
into a single helper used by both handlers, and broaden it to catch
truncated/empty JSON failures such as “Unexpected end of JSON input” in addition
to the current unexpected-token case. Ensure the `logger.error` branch returns
consistently for all JSON parse errors, and review the `uncaughtException` path
so it does not continue execution unless that path is explicitly safe.
In `@src/utils/usageTracker.ts`:
- Around line 361-364: The inline config write in markFeedbackGiven is still
blocking the tool-call response path by awaiting configManager.setValue, which
can hang under threadpool saturation. Update markFeedbackGiven to use the same
non-blocking config save path as saveOnboardingState, and ensure the
usageTracker flow no longer awaits persistence before returning so the response
path stays unblocked.
---
Nitpick comments:
In `@src/config-manager.ts`:
- Around line 181-219: The serialization fix in writeConfigToDisk, saveConfig,
and scheduleSave prevents overlapping writes, but the direct fs.writeFile
overwrite is still crash-prone. Update writeConfigToDisk to use an atomic
temp-file write followed by rename so a mid-write process crash cannot leave
config.json truncated or corrupt. Keep the existing writeChain behavior
unchanged and ensure both saveConfig and scheduleSave still call the same atomic
write path.
In `@src/index.ts`:
- Around line 137-163: Fatal startup failures in runServer() can drop buffered
diagnostics from deferredMessages because the catch path only logs the error and
sends a notification. Before process.exit(1), flush any queued deferLog output
from deferredMessages to the logger/stdout in the catch block around
server.connect(), so config-load and other deferred context is preserved. Use
the existing runServer, deferLog, and deferredMessages flow to emit those
messages once on fatal error, then continue with the current error reporting and
exit behavior.
In `@test/repro/test-interact-join-stall.js`:
- Around line 27-55: The repro script only prints a stall verdict and always
exits successfully, so it cannot fail CI when the regression reappears. Update
the logic around the event-loop monitor variables maxLag, sumLag, and the
phase-2 verdict so that the script sets a non-zero exit status when the stall
thresholds are exceeded, while keeping the existing diagnostic logs and cleanup
via interactWithProcess intact. Ensure the unconditional process.exit(0) is
replaced with conditional success/failure behavior based on the measured lag.
In `@test/repro/test-withtimeout-leak.js`:
- Around line 1-40: The repro scripts under test/repro duplicate fifo setup,
timestamped logging, and the timeout/guard exit flow, so extract that
boilerplate into a shared helper to prevent drift. Move the repeated pieces into
a small reusable module (for example helper functions used by the current script
and the other repro files) and update this file to use those shared utilities
while keeping the existing repro behavior in place.
In `@test/test-nonblocking-config-save.js`:
- Around line 59-61: The test harness in run() has no overall timeout, so a
regression in setValueNonBlocking or the Promise.all path can hang CI instead of
failing clearly. Update the test runner around run() to race it against a
watchdog timeout and force a clear failure message if the timeout wins. Keep the
change localized to the run().then/.catch entrypoint and preserve the existing
PASS/FAIL output for the normal path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 63037ede-608f-4e47-bf94-600caf85d826
📒 Files selected for processing (13)
src/bootstrap.tssrc/config-manager.tssrc/index.tssrc/utils/usageTracker.tstest/repro/test-bootstrap-threadpool.jstest/repro/test-dc-tracking-gate.jstest/repro/test-env-threadpool-timing.jstest/repro/test-interact-join-stall.jstest/repro/test-process-wait-client-cap.jstest/repro/test-start-process-timeout-block.jstest/repro/test-threadpool-starvation.jstest/repro/test-withtimeout-leak.jstest/test-nonblocking-config-save.js
| let ORIGINAL_CAP; | ||
| async function setCap(v) { await configManager.setValue(CAP_KEY, v); } | ||
| async function snapshotCap() { ORIGINAL_CAP = await configManager.getValue(CAP_KEY); } | ||
| async function restoreCap() { await configManager.setValue(CAP_KEY, ORIGINAL_CAP); } |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect how configManager resolves its persisted config path
rg -n "CONFIG_PATH|configPath|homedir|appData|process\.env\." src/config-manager.tsRepository: wonderwhy-er/DesktopCommanderMCP
Length of output: 664
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== src/config-manager.ts (selected lines) =="
cat -n src/config-manager.ts | sed -n '1,240p'
echo
echo "== test/repro/test-process-wait-client-cap.js (selected lines) =="
cat -n test/repro/test-process-wait-client-cap.js | sed -n '1,220p'
echo
echo "== locate CONFIG_FILE definition/usages =="
rg -n "CONFIG_FILE|configPath" src test distRepository: wonderwhy-er/DesktopCommanderMCP
Length of output: 22940
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== src/config.ts =="
cat -n src/config.ts
echo
echo "== search for config path overrides =="
rg -n "CONFIG_DIR|CONFIG_FILE|desktop-commander|process\.env\.(CONFIG|DESKTOP|HOME|XDG)|homedir\(" src testRepository: wonderwhy-er/DesktopCommanderMCP
Length of output: 7810
Isolate this repro from the shared config file. setCap/snapshotCap/restoreCap mutate the singleton config at ~/.claude-server-commander/config.json, so this repro can leave a stray maxProcessWaitMs entry or overwrite user settings if it exits before restore. Use a temp HOME/config path (or stub CONFIG_FILE) for the repro.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/repro/test-process-wait-client-cap.js` around lines 50 - 53, The repro
helper functions setCap, snapshotCap, and restoreCap are mutating the shared
singleton config, so isolate this test from the user’s real config. Update the
repro setup in test-process-wait-client-cap.js to point configManager at a temp
HOME/config location or stub CONFIG_FILE before calling
setCap/snapshotCap/restoreCap, and ensure the isolated path is used throughout
the repro lifecycle.
| async function run() { | ||
| await configManager.getConfig(); // warm init so disk read is cached | ||
|
|
||
| // 1) A burst of non-blocking saves must resolve effectively immediately — | ||
| // they must not each wait on a disk write. | ||
| const BURST = 100; | ||
| const t0 = Date.now(); | ||
| await Promise.all( | ||
| Array.from({ length: BURST }, (_, i) => configManager.setValueNonBlocking(KEY, i)) | ||
| ); | ||
| const elapsed = Date.now() - t0; | ||
| assert.ok(elapsed < 200, `burst of ${BURST} non-blocking saves took ${elapsed}ms (expected < 200ms)`); | ||
| passed++; console.log(`✓ ${BURST} non-blocking saves resolved in ${elapsed}ms`); | ||
|
|
||
| // 2) The in-memory value is visible immediately (synchronously updated). | ||
| assert.strictEqual(await configManager.getValue(KEY), BURST - 1); | ||
| passed++; console.log('✓ in-memory value reflects the latest write immediately'); | ||
|
|
||
| // 3) After the background flush window, config.json is valid JSON (no torn | ||
| // write from overlapping saves) and holds the final coalesced value. | ||
| await new Promise((r) => setTimeout(r, 300)); | ||
| let parsed; | ||
| assert.doesNotThrow(() => { parsed = JSON.parse(readFileSync(CONFIG_FILE, 'utf8')); }, | ||
| 'config.json must remain valid JSON after concurrent writes'); | ||
| assert.strictEqual(parsed[KEY], BURST - 1, 'final value must be persisted to disk'); | ||
| passed++; console.log('✓ config.json is valid and holds the coalesced final value'); | ||
|
|
||
| // cleanup: remove the test key (undefined is dropped by JSON.stringify) | ||
| await configManager.setValue(KEY, undefined); | ||
| } | ||
|
|
||
| run() | ||
| .then(() => { console.log(`\nPASS (${passed}/3)`); process.exit(0); }) | ||
| .catch((e) => { console.error(`\nFAIL: ${e.message}`); process.exit(1); }); |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Failure path can corrupt the real config.json and leaks the test key.
If any assertion in run() throws (lines 39, 43, 51, or 52), execution jumps straight to the .catch() at line 61, which calls process.exit(1) immediately. At that point:
- The background write scheduled by the burst's
setValueNonBlockingcalls (scheduleSave()insrc/config-manager.ts) may still be in flight on the writeChain.process.exit()can abort pending async I/O, risking a truncated/torn write to the user's realconfig.json— the exact corruption scenario this PR's serialized writeChain is meant to prevent. - Cleanup at line 56 (
setValue(KEY, undefined)) never runs, so__nonblockingSaveRegressionTestis permanently left in the user's actual config file (this test operates directly onCONFIG_FILE, the real~/.claude-server-commander/config.json, not an isolated fixture).
Wrap the test body in try/finally so cleanup always runs and pending writes are given a chance to settle before exiting.
♻️ Proposed fix
async function run() {
await configManager.getConfig(); // warm init so disk read is cached
- // 1) A burst of non-blocking saves must resolve effectively immediately —
- // they must not each wait on a disk write.
- const BURST = 100;
- const t0 = Date.now();
- await Promise.all(
- Array.from({ length: BURST }, (_, i) => configManager.setValueNonBlocking(KEY, i))
- );
- const elapsed = Date.now() - t0;
- assert.ok(elapsed < 200, `burst of ${BURST} non-blocking saves took ${elapsed}ms (expected < 200ms)`);
- passed++; console.log(`✓ ${BURST} non-blocking saves resolved in ${elapsed}ms`);
-
- // 2) The in-memory value is visible immediately (synchronously updated).
- assert.strictEqual(await configManager.getValue(KEY), BURST - 1);
- passed++; console.log('✓ in-memory value reflects the latest write immediately');
-
- // 3) After the background flush window, config.json is valid JSON (no torn
- // write from overlapping saves) and holds the final coalesced value.
- await new Promise((r) => setTimeout(r, 300));
- let parsed;
- assert.doesNotThrow(() => { parsed = JSON.parse(readFileSync(CONFIG_FILE, 'utf8')); },
- 'config.json must remain valid JSON after concurrent writes');
- assert.strictEqual(parsed[KEY], BURST - 1, 'final value must be persisted to disk');
- passed++; console.log('✓ config.json is valid and holds the coalesced final value');
-
- // cleanup: remove the test key (undefined is dropped by JSON.stringify)
- await configManager.setValue(KEY, undefined);
+ const BURST = 100;
+ try {
+ // 1) A burst of non-blocking saves must resolve effectively immediately —
+ // they must not each wait on a disk write.
+ const t0 = Date.now();
+ await Promise.all(
+ Array.from({ length: BURST }, (_, i) => configManager.setValueNonBlocking(KEY, i))
+ );
+ const elapsed = Date.now() - t0;
+ assert.ok(elapsed < 200, `burst of ${BURST} non-blocking saves took ${elapsed}ms (expected < 200ms)`);
+ passed++; console.log(`✓ ${BURST} non-blocking saves resolved in ${elapsed}ms`);
+
+ // 2) The in-memory value is visible immediately (synchronously updated).
+ assert.strictEqual(await configManager.getValue(KEY), BURST - 1);
+ passed++; console.log('✓ in-memory value reflects the latest write immediately');
+
+ // 3) After the background flush window, config.json is valid JSON (no torn
+ // write from overlapping saves) and holds the final coalesced value.
+ await new Promise((r) => setTimeout(r, 300));
+ let parsed;
+ assert.doesNotThrow(() => { parsed = JSON.parse(readFileSync(CONFIG_FILE, 'utf8')); },
+ 'config.json must remain valid JSON after concurrent writes');
+ assert.strictEqual(parsed[KEY], BURST - 1, 'final value must be persisted to disk');
+ passed++; console.log('✓ config.json is valid and holds the coalesced final value');
+ } finally {
+ // cleanup: always remove the test key, even on assertion failure, and
+ // await it so no torn write to the real config.json is left in flight.
+ await configManager.setValue(KEY, undefined).catch(() => {});
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async function run() { | |
| await configManager.getConfig(); // warm init so disk read is cached | |
| // 1) A burst of non-blocking saves must resolve effectively immediately — | |
| // they must not each wait on a disk write. | |
| const BURST = 100; | |
| const t0 = Date.now(); | |
| await Promise.all( | |
| Array.from({ length: BURST }, (_, i) => configManager.setValueNonBlocking(KEY, i)) | |
| ); | |
| const elapsed = Date.now() - t0; | |
| assert.ok(elapsed < 200, `burst of ${BURST} non-blocking saves took ${elapsed}ms (expected < 200ms)`); | |
| passed++; console.log(`✓ ${BURST} non-blocking saves resolved in ${elapsed}ms`); | |
| // 2) The in-memory value is visible immediately (synchronously updated). | |
| assert.strictEqual(await configManager.getValue(KEY), BURST - 1); | |
| passed++; console.log('✓ in-memory value reflects the latest write immediately'); | |
| // 3) After the background flush window, config.json is valid JSON (no torn | |
| // write from overlapping saves) and holds the final coalesced value. | |
| await new Promise((r) => setTimeout(r, 300)); | |
| let parsed; | |
| assert.doesNotThrow(() => { parsed = JSON.parse(readFileSync(CONFIG_FILE, 'utf8')); }, | |
| 'config.json must remain valid JSON after concurrent writes'); | |
| assert.strictEqual(parsed[KEY], BURST - 1, 'final value must be persisted to disk'); | |
| passed++; console.log('✓ config.json is valid and holds the coalesced final value'); | |
| // cleanup: remove the test key (undefined is dropped by JSON.stringify) | |
| await configManager.setValue(KEY, undefined); | |
| } | |
| run() | |
| .then(() => { console.log(`\nPASS (${passed}/3)`); process.exit(0); }) | |
| .catch((e) => { console.error(`\nFAIL: ${e.message}`); process.exit(1); }); | |
| async function run() { | |
| await configManager.getConfig(); // warm init so disk read is cached | |
| const BURST = 100; | |
| try { | |
| // 1) A burst of non-blocking saves must resolve effectively immediately — | |
| // they must not each wait on a disk write. | |
| const t0 = Date.now(); | |
| await Promise.all( | |
| Array.from({ length: BURST }, (_, i) => configManager.setValueNonBlocking(KEY, i)) | |
| ); | |
| const elapsed = Date.now() - t0; | |
| assert.ok(elapsed < 200, `burst of ${BURST} non-blocking saves took ${elapsed}ms (expected < 200ms)`); | |
| passed++; console.log(`✓ ${BURST} non-blocking saves resolved in ${elapsed}ms`); | |
| // 2) The in-memory value is visible immediately (synchronously updated). | |
| assert.strictEqual(await configManager.getValue(KEY), BURST - 1); | |
| passed++; console.log('✓ in-memory value reflects the latest write immediately'); | |
| // 3) After the background flush window, config.json is valid JSON (no torn | |
| // write from overlapping saves) and holds the final coalesced value. | |
| await new Promise((r) => setTimeout(r, 300)); | |
| let parsed; | |
| assert.doesNotThrow(() => { parsed = JSON.parse(readFileSync(CONFIG_FILE, 'utf8')); }, | |
| 'config.json must remain valid JSON after concurrent writes'); | |
| assert.strictEqual(parsed[KEY], BURST - 1, 'final value must be persisted to disk'); | |
| passed++; console.log('✓ config.json is valid and holds the coalesced final value'); | |
| } finally { | |
| // cleanup: always remove the test key, even on assertion failure, and | |
| // await it so no torn write to the real config.json is left in flight. | |
| await configManager.setValue(KEY, undefined).catch(() => {}); | |
| } | |
| } | |
| run() | |
| .then(() => { console.log(`\nPASS (${passed}/3)`); process.exit(0); }) | |
| .catch((e) => { console.error(`\nFAIL: ${e.message}`); process.exit(1); }); |
🧰 Tools
🪛 ast-grep (0.44.0)
[warning] 47-47: Avoid using the initial state variable in setState
Context: setTimeout(r, 300)
Note: [CWE-710] Improper Adherence to Coding Standards. Security best practice.
(setstate-same-var)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/test-nonblocking-config-save.js` around lines 28 - 61, The failure path
in run() can exit before pending non-blocking writes finish and before the test
key cleanup runs. Update the test around configManager.setValueNonBlocking,
scheduleSave, and the final cleanup so the body uses try/finally: always remove
__nonblockingSaveRegressionTest with configManager.setValue(KEY, undefined), and
allow any queued writeChain work to settle before the process exits from the
.catch() path.
Follow-up to the parallel-load hang fix. withTimeout() rejected on a timer but left the underlying fs op running, so a stalled read kept holding its libuv thread until the OS call returned. And the read timeout was a fixed 30s, which conflates "stalled" with "large file on slow media". Changes: - runWithAbortableTimeout(fn(signal), ms, name): passes an AbortSignal into the operation and aborts it on timeout, so signal-honoring reads are cancelled and their fd/thread released. Rejects with code ETIMEDOUT. - Read timeout is now size-aware: FLOOR (30s stall detector for small files) + size / assumed-throughput (>=4 MB/s) capped at 10 min. A 1GB file gets ~4.8 min instead of dying at 30s; a wedged tiny file still fails fast. - Thread AbortSignal through the read path: ReadOptions.signal -> TextFileHandler (createReadStream/readFile/fd loops) and ImageFileHandler. - readFileInternal (edit_block's whole-file read) had NO timeout at all; it now uses the same size-aware, cancellable read. - Raise the read_file handler backstop (60s -> 15min) so it sits above the size-aware inner cap and never cuts a large-file read short. Caveat (documented in code): a syscall wedged in-kernel on a cold cloud file only observes the abort once it returns; library readers (Excel/PDF) ignore the signal and still get the timeout but keep running in the background. The threadpool headroom from the previous commit covers those cases. Tests: - test/test-read-abort-timeout.js (CI): abort fires + ETIMEDOUT, size-aware timeout floor/scale/cap, normal read still works. 4/4. - test/repro/test-read-abort-frees-thread.js: proves the signal reaches fs and cancels a real read (rejects instead of returning the buffer).
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/tools/filesystem.ts (1)
504-522: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winTimebox the pre-read
fs.statcalls — This repo targets Node >=18, andfsPromises.stat()still doesn’t acceptsignalthere, so both the directory check and size probe can block before the abortable read starts on slow mounts. Wrap each stat in a bounded timeout/race so callers don’t wait on the full mount latency.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/tools/filesystem.ts` around lines 504 - 522, The pre-read fs.stat probes in the filesystem read flow can still block on slow mounts before the abortable read begins. Update the stat checks in the validation and size-probe path of the file read logic in filesystem.ts to use a bounded timeout/race so they fail fast instead of waiting on the full mount latency. Keep the existing read behavior and telemetry, but ensure the stat calls in the relevant helper flow are time-limited.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/tools/filesystem.ts`:
- Around line 504-522: The pre-read fs.stat probes in the filesystem read flow
can still block on slow mounts before the abortable read begins. Update the stat
checks in the validation and size-probe path of the file read logic in
filesystem.ts to use a bounded timeout/race so they fail fast instead of waiting
on the full mount latency. Keep the existing read behavior and telemetry, but
ensure the stat calls in the relevant helper flow are time-limited.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: db27ffff-72d2-4f6e-8374-8f34c05f17f6
📒 Files selected for processing (8)
src/handlers/filesystem-handlers.tssrc/tools/filesystem.tssrc/utils/files/base.tssrc/utils/files/image.tssrc/utils/files/text.tssrc/utils/withTimeout.tstest/repro/test-read-abort-frees-thread.jstest/test-read-abort-timeout.js
Per review: replace the size-aware read timeout with a flat 3 minutes. The MCP client hard-caps a tool call at ~4 minutes, so timing out at 3m lets us abort the fs op (via runWithAbortableTimeout) and return the useful cloud-storage error before the client reports an opaque timeout. The AbortController cancellation is unchanged; only the duration/shape of the budget changed. readFileFromDisk and readFileInternal both use it; the read_file handler backstop is 3m30s (above the inner read, below the client cap). computeReadTimeoutMs removed.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/tools/filesystem.ts (2)
542-559: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueStale dead-code branch:
isWithTimeoutStringno longer applies to this call site.This catch block still checks for the old
withTimeoutplain-string rejection format, butreadOperationis now run viarunWithAbortableTimeout(Lines 545-549), which per its contract always rejects with a properErrorcarrying.code = 'ETIMEDOUT'— never a string.isWithTimeoutStringwill always befalsehere now, so the comment (Lines 552-554) describing the string-rejection case is misleading for this specific call site and the branch is effectively dead.♻️ Suggested cleanup
} catch (error) { const err = error as NodeJS.ErrnoException; - // withTimeout rejects with a plain string "__ERROR__: ... timed out after N seconds" - // when defaultValue is null — it has no .code property, so check for that too. - const isWithTimeoutString = typeof error === 'string' && (error as string).startsWith('__ERROR__:'); - if (isWithTimeoutString || err.code === 'EPERM' || err.code === 'EACCES' || err.code === 'ETIMEDOUT') { - throw buildPermissionError(filePath, isWithTimeoutString ? 'ETIMEDOUT' : err.code); + if (err.code === 'EPERM' || err.code === 'EACCES' || err.code === 'ETIMEDOUT') { + throw buildPermissionError(filePath, err.code); } throw error; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/tools/filesystem.ts` around lines 542 - 559, In the read path handled by `runWithAbortableTimeout` inside `readOperation`, remove the stale `isWithTimeoutString` handling from the catch block because this helper now always rejects with an `Error` and `.code = 'ETIMEDOUT'`. Update the `buildPermissionError` call and surrounding comments in `filesystem.ts` to reflect the current contract, and keep only the error-code checks that can actually occur here.
23-29: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winTighten the timeout comments and remove the stale string fallback
src/tools/filesystem.ts:23-29, 506-508, 614-621should avoid saying thefd/threadis freed immediately;runWithAbortableTimeoutfails the call fast, but a blocked syscall can still keep running underneath.src/tools/filesystem.ts:552-556still has leftoverwithTimeoutstring handling here; this path now raisesETIMEDOUTerrors.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/tools/filesystem.ts` around lines 23 - 29, Update the timeout documentation around READ_OPERATION_TIMEOUT_MS and the related filesystem read path so it no longer claims the fd/thread is released immediately; describe it as failing fast via runWithAbortableTimeout while the underlying syscall may still continue briefly. Also remove the stale withTimeout string fallback in the read/error handling path that now raises ETIMEDOUT, and update the affected logic in the filesystem read helper to only handle the current error shape.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/tools/filesystem.ts`:
- Around line 542-559: In the read path handled by `runWithAbortableTimeout`
inside `readOperation`, remove the stale `isWithTimeoutString` handling from the
catch block because this helper now always rejects with an `Error` and `.code =
'ETIMEDOUT'`. Update the `buildPermissionError` call and surrounding comments in
`filesystem.ts` to reflect the current contract, and keep only the error-code
checks that can actually occur here.
- Around line 23-29: Update the timeout documentation around
READ_OPERATION_TIMEOUT_MS and the related filesystem read path so it no longer
claims the fd/thread is released immediately; describe it as failing fast via
runWithAbortableTimeout while the underlying syscall may still continue briefly.
Also remove the stale withTimeout string fallback in the read/error handling
path that now raises ETIMEDOUT, and update the affected logic in the filesystem
read helper to only handle the current error shape.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: cb13c4e9-b9fa-4f1d-83b2-acf3fa66c6ff
📒 Files selected for processing (3)
src/handlers/filesystem-handlers.tssrc/tools/filesystem.tstest/test-read-abort-timeout.js
readFileFromDisk now times out via runWithAbortableTimeout, which rejects with a coded Error (.code === 'ETIMEDOUT') rather than the legacy '__ERROR__:' string that withTimeout(..., null) produced. The string check in the catch was therefore unreachable; the existing ETIMEDOUT branch already handles it.
The integration case read its own path via readFile, which enforces allowedDirectories from the on-disk config. When another suite test had left allowedDirectories pointing elsewhere (e.g. test/test_output), the self-read was rejected and the test failed depending on ambient config. It now creates its own temp dir, sets allowedDirectories to it (realpath'd to avoid the macOS /tmp symlink mismatch), reads a temp file, and restores the original config in a finally. Also dropped the now-unused __dirname/ fileURLToPath.
Problem
Under heavy parallel file I/O — e.g. several
claude -pagents reading/writing files on a slow or cloud-synced filesystem — tool calls could hang for ~4 minutes and then work again on retry. Crucially, even pure in-memory tools likelist_processeshung, which pointed past any single tool to a shared bottleneck.Reported by a user running MCP in Claude Code with ~4–6 parallel subprocesses; reproduced locally.
Root cause
Three compounding issues:
await usageTracker.trackSuccess(name)before returning, which chains throughconfigManager.setValue→fs.writeFile.fsruns on libuv's threadpool (default 4 threads). When parallel reads stalled on a slow filesystem occupied all 4 threads — and stayed occupied — every tool's response was gated, including pure-memory ones.withTimeout) did not cancel the underlying op, so a stalled read held its thread for the full OS duration, and read timeouts undercut / conflicted with the client's own cap.Each call also fired an independent
fs.writeFileof the same config file — a latent corruption risk under concurrency.Fixes
configManager.setValueNonBlocking()updates memory synchronously and schedules a background write on one serialized chain; a burst collapses to at most one queued write.usageTracker.saveStatsuses it, so the response never waits on disk. Serialization also removes the concurrent-write corruption risk.UV_THREADPOOL_SIZE(default16, respects a user-set value) viasrc/bootstrap.ts, imported first inindex.ts— before any fs work is submitted, which is required for libuv to pick it up.runWithAbortableTimeout()passes anAbortSignalinto the read and aborts it on timeout, so a stalled read is cancelled and its fd/thread released instead of leaked. The timeout is a flat 3 minutes, chosen to sit below the MCP client's ~4-minute hard cap: we abort the fs op and return the useful cloud-storage error (buildPermissionError) before the client gives up with an opaque "No result received after 4 minutes". Signal is threaded throughReadOptions→ text/image handlers;readFileInternal(theedit_blockread, previously untimed) now uses it too. The read_file handler backstop is 3m30s — above the inner read, still below the client cap. Caveat: an in-kernel-wedged syscall only observes the abort once it returns, and library readers (Excel/PDF) ignore the signal — the threadpool headroom covers those.Verification
Reproductions in
test/repro/(deterministic, kept out of CI):test-threadpool-starvation.js(pool=4, 4 stalled reads)test-dc-tracking-gate.js— realtrackSuccess()under starved pooltest-bootstrap-threadpool.js— shipped bootstrap, no env overridetest-withtimeout-leak.js/test-read-abort-frees-thread.jsCI regression tests (fast, cross-platform):
test/test-nonblocking-config-save.js— non-blocking + coalesced + validconfig.jsonunder 100 concurrent writes. PASS (3/3).test/test-read-abort-timeout.js— abort fires + ETIMEDOUT, timeout is 3m and below the client cap, normal read still works. PASS (4/4).ab-test.test.js(17/0),test-file-handlers.js,test-line-count.js,test-edit-block-line-endings.jsall pass; server module loads cleanly.Note:
test-edit-block-occurrences.jsTest 3 fails, but it fails identically onmain— a pre-existing issue unrelated to this PR.Follow-ups (not in this PR)
interact_with_processre-join()s the whole (≤50MB) output buffer every 50ms poll (terminal-manager.ts), burning ~20% of the event loop for a single active session and scaling with concurrency; track a byte offset instead. Characterized intest/repro/test-interact-join-stall.js.toolHistoryflush uses syncappendFileSync/statSyncon a 1s interval — a small periodic event-loop hitch, not on the response path.Summary by CodeRabbit
New Features
Bug Fixes