Skip to content

Commit 65c4a4b

Browse files
edgarsskoreclaude
andauthored
fix: recover device Realtime channel from half-open socket on reconnect (#520)
* fix: recover device Realtime channel from half-open socket on reconnect After idle / wifi-loss / sleep the device's Supabase Realtime socket can go half-open (conn.readyState stays OPEN but the peer is gone). recreateChannel() removed the old channel un-awaited and synchronously pushed a new one, so the channel registry never reached 0, realtime-js never tore the dead socket down, and every re-subscribe TIMED_OUT forever -- only a process restart recovered. Fix (remote-channel.ts): - recreateChannel(): add a re-entrancy guard, await removeChannel(), and force a fresh WebSocket via realtime.disconnect() before re-subscribing. - checkConnectionHealth(): treat 'joining' as healthy so realtime-js's own rejoin backoff can converge instead of being torn down mid-join. Also enrich the existing reconnect/timeout logs with a compact connState() line (socket state + readyState + channel state + attempt), turn on the previously commented-out CHANNEL_ERROR log, and add a CLOSED branch. Add test/remote-channel-reconnect.test.ts -- a deterministic repro that fails on the old behavior (8x TIMED_OUT, dead socket reused) and passes with the fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test+fix: address PR #520 review (timeout-guard recreate, settle CLOSED, suite-native test) - recreateChannel(): wrap the awaited section in a 30s timeout (Promise.race, mirrors closeWithTimeout) so a never-settling await can't pin isRecreatingChannel=true and silently disable the 10s watchdog. - createChannel() CLOSED branch: reject() (was un-settled -> could hang the recreate) and setOnlineStatus('offline') for parity with CHANNEL_ERROR/TIMED_OUT. - Rewrite the repro test as test/test-remote-channel-reconnect.js (plain JS, imports compiled dist, runs under node) so run-all-tests.js discovers it and it runs in `npm test` -- no runner/package.json special-casing. Removes the tsx-only .ts version. Adds a 'joining is treated as healthy' case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent be82290 commit 65c4a4b

2 files changed

Lines changed: 376 additions & 29 deletions

File tree

src/remote-device/remote-channel.ts

Lines changed: 103 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,9 @@ interface DeviceData {
1717
}
1818

1919
const HEARTBEAT_INTERVAL = 15000;
20+
// Cap a single channel recreate so a hung await can't pin the re-entrancy guard
21+
// true (which would silently disable the connection watchdog).
22+
const RECREATE_TIMEOUT_MS = 30000;
2023

2124
export class RemoteChannel {
2225
private client: SupabaseClient | null = null;
@@ -35,6 +38,10 @@ export class RemoteChannel {
3538
// Track last channel state for debug logging
3639
private lastChannelState: string | null = null;
3740

41+
// Reconnect diagnostics + guard (see connState() / recreateChannel())
42+
private reconnectAttempt = 0; // recreateChannel() attempts since last success
43+
private isRecreatingChannel = false; // a recreate is in flight (re-entrancy guard)
44+
3845
private _user: User | null = null;
3946
get user(): User | null { return this._user; }
4047

@@ -166,7 +173,7 @@ export class RemoteChannel {
166173

167174
// ! Ignore silently in Initialization to reconnect after
168175
await this.createChannel().catch((error) => {
169-
console.debug('[DEBUG] Failed to create channel, will retry after socket reconnect', error);
176+
console.debug(`[DEBUG] Failed to create channel, will retry after socket reconnect: ${error?.message || error}${this.connState()}`);
170177
});
171178

172179
} else {
@@ -206,10 +213,12 @@ export class RemoteChannel {
206213
)
207214
.subscribe((status: string, err: any) => {
208215
// Debug: Log all subscription status events
209-
console.debug(`[DEBUG] Channel subscription status: ${status}${err ? ' (error: ' + err + ')' : ''}`);
216+
console.debug(`[DEBUG] Channel subscription status: ${status}${err ? ' (error: ' + (err?.message || err) + ')' : ''}${this.connState()}`);
210217

211218
if (status === 'SUBSCRIBED') {
212-
console.log('✅ Channel subscribed');
219+
const recovered = this.reconnectAttempt;
220+
this.reconnectAttempt = 0;
221+
console.log(`✅ Channel subscribed${recovered > 0 ? ` (recovered after ${recovered} attempt${recovered === 1 ? '' : 's'})` : ''}`);
213222
// Update device status on successful connection
214223
if (this.deviceId) {
215224
this.setOnlineStatus(this.deviceId, 'online').catch(e => {
@@ -218,20 +227,42 @@ export class RemoteChannel {
218227
}
219228
resolve();
220229
} else if (status === 'CHANNEL_ERROR') {
221-
// console.error('❌ Channel subscription failed:', err);
230+
// CHANNEL_ERROR is the only status carrying a real error message.
231+
console.error(`❌ Channel error: ${err?.message || 'unknown'}${this.connState()}`);
222232
this.setOnlineStatus(this.deviceId!, 'offline');
223-
captureRemote('remote_channel_subscription_error', { error: err || 'Channel error' }).catch(() => { });
233+
captureRemote('remote_channel_subscription_error', { error: err?.message || 'Channel error' }).catch(() => { });
224234
reject(err || new Error('Failed to initialize tool call channel subscription'));
225235
} else if (status === 'TIMED_OUT') {
226-
console.error('⏱️ Channel subscription timed out, Reconnecting...');
236+
console.error(`⏱️ Channel subscription timed out, Reconnecting...${this.connState()}`);
227237
this.setOnlineStatus(this.deviceId!, 'offline');
228-
captureRemote('remote_channel_subscription_timeout', {}).catch(() => { });
238+
captureRemote('remote_channel_subscription_timeout', { attempt: this.reconnectAttempt }).catch(() => { });
229239
reject(new Error('Tool call channel subscription timed out'));
240+
} else if (status === 'CLOSED') {
241+
// Settle the promise so an in-flight recreateChannel() can't await
242+
// forever (which would wedge the re-entrancy guard / watchdog), and
243+
// mark the device offline like the other degraded states.
244+
console.warn(`⚠️ Channel closed — ${this.connState()}`);
245+
this.setOnlineStatus(this.deviceId!, 'offline');
246+
reject(new Error('Tool call channel closed during subscribe'));
230247
}
231248
});
232249
});
233250
}
234251

252+
/**
253+
* Compact connection state for logs — e.g. "socket=open(1) ch=errored attempt=3".
254+
* readyState 1=OPEN (a 1 while joins keep failing = a half-open socket being reused),
255+
* 3=CLOSED, '-'=no socket. Reads realtime-js internals defensively; never throws.
256+
*/
257+
private connState(): string {
258+
let socket = '?';
259+
try {
260+
const rt: any = (this.client as any)?.realtime;
261+
socket = `${rt?.connectionState?.() ?? '?'}(${rt?.conn?.readyState ?? '-'})`;
262+
} catch { /* best effort */ }
263+
return `socket=${socket} ch=${this.channel?.state ?? '-'} attempt=${this.reconnectAttempt}`;
264+
}
265+
235266
/**
236267
* Check if channel is connected, recreate if not.
237268
*/
@@ -244,47 +275,90 @@ export class RemoteChannel {
244275

245276
// Debug: Log current channel state (only if changed)
246277
if (!this.lastChannelState || this.lastChannelState !== state) {
247-
console.debug(`[DEBUG] channel state: ${state}`);
278+
console.debug(`[DEBUG] channel state: ${state}${this.connState()}`);
248279
this.lastChannelState = state;
249280
}
250281

251-
// Aggressive health check: Only 'joined' is considered healthy
252-
// Any other state (joining, leaving, closed, errored, etc.) triggers recreation
253-
if (state !== 'joined') {
254-
captureRemote('remote_channel_state_health', { state });
282+
// 'joined' = healthy, 'joining' = transitional — let realtime-js's own rejoin
283+
// backoff converge instead of tearing the channel down mid-join. (FIX: previously
284+
// recreated on every non-joined state, which amputated that backoff.)
285+
if (state === 'joined' || state === 'joining') return;
255286

256-
console.debug(`[DEBUG] ⚠️ Channel in unhealthy state '${state}' - recreating...`);
257-
this.recreateChannel();
287+
// Unhealthy: closed, errored, leaving — recreate
288+
captureRemote('remote_channel_state_health', { state, attempt: this.reconnectAttempt });
289+
console.debug(`[DEBUG] ⚠️ Channel in unhealthy state '${state}' - recreating... — ${this.connState()}`);
290+
this.recreateChannel();
291+
}
292+
293+
/**
294+
* Run an async op but reject if it doesn't settle within `ms`, so a hung await
295+
* can't leave isRecreatingChannel stuck true and disable the watchdog. Mirrors
296+
* closeWithTimeout() in desktop-commander-integration.ts.
297+
*/
298+
private async withTimeout<T>(op: () => Promise<T>, ms: number, name: string): Promise<T> {
299+
let timer: NodeJS.Timeout | undefined;
300+
try {
301+
return await Promise.race([
302+
op(),
303+
new Promise<T>((_, reject) => {
304+
timer = setTimeout(() => reject(new Error(`${name} timed out after ${ms}ms`)), ms);
305+
}),
306+
]);
307+
} finally {
308+
if (timer) clearTimeout(timer);
258309
}
259310
}
260311

261312
/**
262313
* Recreate the channel by destroying old one and creating fresh instance.
263314
*/
264-
private recreateChannel(): void {
315+
private async recreateChannel(): Promise<void> {
265316
if (!this.client || !this.user?.id || !this.onToolCall) {
266317
console.warn('Cannot recreate channel - missing parameters');
267318
console.debug('[DEBUG] recreateChannel() aborted - missing prerequisites');
268319
return;
269320
}
270321

271-
// Destroy old channel
272-
if (this.channel) {
273-
console.debug('[DEBUG] Destroying old channel');
274-
this.client.removeChannel(this.channel);
275-
this.channel = null;
322+
// FIX: re-entrancy guard so a 10s health tick can't stack a second recreate
323+
// on top of an in-flight one.
324+
if (this.isRecreatingChannel) {
325+
console.debug('[DEBUG] recreateChannel() skipped - already in progress');
326+
return;
276327
}
328+
this.isRecreatingChannel = true;
329+
this.reconnectAttempt++;
277330

278331
// Create fresh channel
279-
console.log('🔄 Recreating channel...');
280-
console.debug('[DEBUG] Calling createChannel() for recreation');
281-
this.createChannel().catch(err => {
282-
captureRemote('remote_channel_recreate_error', { err });
283-
console.debug('[DEBUG] Channel recreation failed:', err.message);
284-
285-
// TODO: enable only for debug mode
286-
// console.error('Failed to recreate channel:', err);
287-
});
332+
console.log(`🔄 Recreating channel... (attempt ${this.reconnectAttempt}) — ${this.connState()}`);
333+
334+
try {
335+
// Cap the whole recreate: a never-settling await (e.g. a subscribe that only
336+
// ever emits CLOSED) must not pin isRecreatingChannel=true and silently disable
337+
// the 10s watchdog. On timeout we reject -> catch -> finally clears the guard.
338+
await this.withTimeout(async () => {
339+
// Destroy old channel — AWAIT it so the channel registry empties before we
340+
// rebuild. (The un-awaited version raced the synchronous new-channel push, so
341+
// realtime-js never tore the socket down and a half-open one got reused.)
342+
if (this.channel) {
343+
console.debug('[DEBUG] Destroying old channel');
344+
await this.client!.removeChannel(this.channel);
345+
this.channel = null;
346+
}
347+
348+
// FIX (core): force a brand-new WebSocket. After idle / wifi-loss the socket can
349+
// be HALF-OPEN (readyState OPEN but dead); reusing it made every join TIME_OUT
350+
// forever. disconnect() drops it so the next subscribe() dials a fresh one.
351+
try { await (this.client as any).realtime?.disconnect?.(); } catch { /* best effort */ }
352+
353+
console.debug('[DEBUG] Calling createChannel() for recreation');
354+
await this.createChannel();
355+
}, RECREATE_TIMEOUT_MS, 'recreateChannel');
356+
} catch (err: any) {
357+
captureRemote('remote_channel_recreate_error', { errMsg: err?.message, attempt: this.reconnectAttempt });
358+
console.debug(`[DEBUG] Channel recreation failed: ${err?.message}${this.connState()}`);
359+
} finally {
360+
this.isRecreatingChannel = false;
361+
}
288362
}
289363

290364
async markCallExecuting(callId: string) {

0 commit comments

Comments
 (0)