Skip to content

fix: invert TTL refresh guard in DefaultDictWithTimeout (memory leak)#2255

Closed
karesansui-u wants to merge 6 commits into
The-PR-Agent:mainfrom
karesansui-u:fix/ttl-refresh-guard-inversion
Closed

fix: invert TTL refresh guard in DefaultDictWithTimeout (memory leak)#2255
karesansui-u wants to merge 6 commits into
The-PR-Agent:mainfrom
karesansui-u:fix/ttl-refresh-guard-inversion

Conversation

@karesansui-u

Copy link
Copy Markdown
Contributor

Bug description

DefaultDictWithTimeout.__refresh() has an inverted guard condition. The comparison > should be <:

# Current (bug): returns when refresh IS needed
if request_time - self.__last_refresh > self.__refresh_interval:
    return

# Fixed: returns when refresh is NOT needed
if request_time - self.__last_refresh < self.__refresh_interval:
    return

This causes expired keys to never be cleaned up. The to_delete loop only runs when request_time - __last_refresh is small (no time passed = nothing to delete), and skips when enough time has passed (when cleanup is actually needed).

Impact

_duplicate_push_triggers and _pending_task_duplicate_push_conditions in the GitHub App server grow without bound. Every processed PR URL stays in memory forever. On long-running server instances, this is a memory leak.

Fix

Change > to < on line 67 of pr_agent/servers/utils.py.

Affected files

  • pr_agent/servers/utils.py (L67) — 1 character change

@qodo-free-for-open-source-projects

Copy link
Copy Markdown
Contributor

Review Summary by Qodo

Fix inverted TTL refresh guard causing memory leak

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Inverts TTL refresh guard condition from > to <
• Fixes memory leak in DefaultDictWithTimeout expiration logic
• Prevents unbounded growth of duplicate push trigger caches
• Ensures expired keys are properly cleaned up periodically
Diagram
flowchart LR
  A["Guard Condition<br/>request_time - last_refresh"] --> B{"Comparison<br/>Operator"}
  B -->|Before: >| C["Returns early<br/>when refresh needed"]
  B -->|After: <| D["Returns early<br/>when refresh not needed"]
  C --> E["Expired keys<br/>never deleted"]
  D --> F["Expired keys<br/>properly cleaned"]
  E --> G["Memory Leak"]
  F --> H["Fixed"]
Loading

Grey Divider

File Changes

1. pr_agent/servers/utils.py 🐞 Bug fix +1/-1

Invert TTL refresh guard comparison operator

• Changed comparison operator from > to < in __refresh() method guard condition
• Fixes logic so deletion loop runs when enough time has passed
• Prevents memory leak in _duplicate_push_triggers and _pending_task_duplicate_push_conditions
 caches

pr_agent/servers/utils.py


Grey Divider

Qodo Logo

@qodo-free-for-open-source-projects

qodo-free-for-open-source-projects Bot commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

Code Review by Qodo

🐞 Bugs (4) 📘 Rule violations (1) 📜 Skill insights (0)

Grey Divider


Action required

1. Non-atomic dedupe cleanup 🐞 Bug ☼ Reliability
Description
handle_push_trigger_for_new_commits() decides to clean up shared per-PR state while holding the
Condition lock, but performs the actual .pop() outside that lock, allowing a concurrent push-trigger
task to start in between. This can delete the condition/counter for an in-flight task and corrupt
the dedupe bookkeeping (missed wait/notify, inconsistent counters).
Code

pr_agent/servers/github_app.py[R203-215]

+        # release the waiting task block, then clean up if no tasks remain
+        should_cleanup = False
+        try:
+            async with _pending_task_duplicate_push_conditions[api_url]:
+                _pending_task_duplicate_push_conditions[api_url].notify(1)
+                _duplicate_push_triggers[api_url] -= 1
+                should_cleanup = _duplicate_push_triggers[api_url] <= 0
+        except KeyError:
+            # TTL eviction already cleaned up this entry
+            pass
+        if should_cleanup:
+            _duplicate_push_triggers.pop(api_url, None)
+            _pending_task_duplicate_push_conditions.pop(api_url, None)
Evidence
The admission path increments _duplicate_push_triggers[api_url] before acquiring the condition
lock, while the new cleanup code removes the dict entries after releasing the condition
lock—creating a check-then-act race window where a concurrent task can start and then have its state
popped out from under it.

pr_agent/servers/github_app.py[175-195]
pr_agent/servers/github_app.py[202-215]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`handle_push_trigger_for_new_commits()` computes `should_cleanup` under the per-PR `asyncio.Condition` lock, but then removes `_duplicate_push_triggers[api_url]` and `_pending_task_duplicate_push_conditions[api_url]` after releasing that lock. Because the increment/admission path mutates `_duplicate_push_triggers` before acquiring the condition lock, a new task can interleave between the decrement/check and the pop, causing live state to be removed.
### Issue Context
This code is implementing a concurrency/deduplication gate for push triggers per `api_url`. The counter and the condition must have a consistent lifecycle across all concurrent tasks.
### Fix Focus Areas
- pr_agent/servers/github_app.py[175-215]
### Suggested fix direction
- Ensure **all** mutations and lifecycle transitions for both `_duplicate_push_triggers[api_url]` and `_pending_task_duplicate_push_conditions[api_url]` are performed under a single, consistent synchronization mechanism.
- One workable approach: fetch/create the per-PR condition first, then `async with condition:` to protect **both** admission (`setdefault`/increment) and cleanup (decrement + possible pop) for that `api_url`.
- Alternatively, introduce a dedicated lock (global or per-`api_url`) that guards: setdefault/increment, decrement/check, and pop, so no new task can start between the cleanup decision and the removal.
- If you keep the `pop()`, perform it inside the same critical section as the `should_cleanup` decision, and re-check the counter immediately before removing entries.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. pop bypasses TTL bookkeeping 🐞 Bug ☼ Reliability
Description
handle_push_trigger_for_new_commits() removes DefaultDictWithTimeout entries via .pop(), which
does not update DefaultDictWithTimeout’s internal __key_times map. After this PR makes
__refresh() run, a later refresh can try to del self[key] for a key that was already popped from
the dict, raising KeyError and potentially failing webhook handling.
Code

pr_agent/servers/github_app.py[R214-215]

Evidence
The cleanup uses .pop() on DefaultDictWithTimeout, but the class only synchronizes its internal
timestamp map on __delitem__. __refresh() deletes based on __key_times using del self[key],
which will raise if the dict entry was already removed via .pop() and __key_times wasn’t
updated.

pr_agent/servers/github_app.py[203-215]
pr_agent/servers/utils.py[63-72]
pr_agent/servers/utils.py[84-86]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`DefaultDictWithTimeout` maintains a private `__key_times` map that is only kept in sync when deletion goes through `__delitem__`. The new cleanup uses `.pop()`, which bypasses `__delitem__`, leaving stale timestamps behind.
Once `__refresh()` runs (and this PR makes it run at the correct cadence), it will build `to_delete` from `__key_times` and execute `del self[key]`. If the dict entry was previously removed via `.pop()`, `super().__delitem__(key)` raises `KeyError`, propagating out of `__refresh()`/`__getitem__()`.
## Issue Context
This affects both `_duplicate_push_triggers` and `_pending_task_duplicate_push_conditions`, which are `DefaultDictWithTimeout` instances.
## Fix Focus Areas
- pr_agent/servers/github_app.py[203-215]
- pr_agent/servers/utils.py[63-72]
- pr_agent/servers/utils.py[84-86]
## Suggested fix
Prefer one of:
1) **Don’t use `.pop()` on `DefaultDictWithTimeout`**; delete via `del` so `__delitem__` runs, e.g.:

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. TTL evicts in-flight state 🐞 Bug ≡ Correctness
Description
Now that DefaultDictWithTimeout.__refresh() actually runs, it can delete the asyncio.Condition and
counter entries used by handle_push_trigger_for_new_commits while the first task is still
running/waiting, if processing time exceeds the configured TTL. That can cause a waiting task to
never be notified (notify happens on a newly-created Condition) and/or raise KeyError when
decrementing _duplicate_push_triggers in the finally block.
Code

pr_agent/servers/utils.py[R67-68]

Evidence
DefaultDictWithTimeout will now perform periodic TTL-based deletion (every refresh_interval) of any
key whose last recorded access time is older than ttl. The GitHub App uses this structure to store
per-PR-url synchronization primitives and a running-task counter, but it does not touch these dict
keys during the long await of PR processing, so another webhook can trigger a refresh and evict the
key while it is still needed; default TTL is 300 seconds and the e2e test explicitly allows ~5
minutes for processing, making this eviction plausible in normal operation.

pr_agent/servers/utils.py[63-72]
pr_agent/servers/github_app.py[175-206]
pr_agent/settings/configuration.toml[234-255]
tests/e2e_tests/test_github_app.py[23-29]
pr_agent/tools/pr_reviewer.py[152-156]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`DefaultDictWithTimeout` eviction is now active and can delete `_duplicate_push_triggers[api_url]` and `_pending_task_duplicate_push_conditions[api_url]` while `handle_push_trigger_for_new_commits` is still processing. This can break synchronization: a waiter may block forever if `notify()` runs on a newly-created `Condition`, and `_duplicate_push_triggers[api_url] -= 1` can raise if the key was evicted.
## Issue Context
- The dicts are accessed, then the code awaits long-running PR processing (`agent.handle_request` via `_perform_auto_commands_github`) without touching those keys.
- TTL default is 300s, and the e2e GitHub App test allows ~5 minutes processing.
## Fix Focus Areas
- pr_agent/servers/github_app.py[175-206]
- pr_agent/servers/utils.py[63-78]
- pr_agent/settings/configuration.toml[234-252]
## Implementation notes
- Prefer explicit lifecycle cleanup: in the `finally` block, after decrementing, if the counter becomes `0`, delete both dict entries for `api_url` (guard with `try/except KeyError`).
- If you keep TTL, ensure it cannot evict in-flight state: either set TTL to a safe upper bound relative to worst-case processing time or periodically refresh/touch the key during processing.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

4. Cancelled wait leaks dedupe 🐞 Bug ☼ Reliability ⭐ New
Description
handle_push_trigger_for_new_commits() increments the per-PR counter and may await Condition.wait()
before entering the try/finally that decrements and removes the per-PR dedupe entries. If that wait
is cancelled/raises (only possible on the current_active_tasks==1 waiter path), the function exits
without cleanup and the per-PR dedupe state can remain stuck indefinitely because these dicts are
configured with ttl=None.
Code

pr_agent/servers/github_app.py[R181-198]

+    async with _pending_task_duplicate_push_conditions[api_url]:
+        current_active_tasks = _duplicate_push_triggers.setdefault(api_url, 0)
+        max_active_tasks = 2 if get_settings().github_app.push_trigger_pending_tasks_backlog else 1
+        if current_active_tasks >= max_active_tasks:
+            get_logger().info(
+                f"Skipping push trigger for {api_url=} because another event already triggered the same processing"
+            )
+            return {}
        # first task can enter, and second tasks too if backlog is enabled
        get_logger().info(
            f"Continue processing push trigger for {api_url=} because there are {current_active_tasks} active tasks"
        )
        _duplicate_push_triggers[api_url] += 1
-    else:
-        get_logger().info(
-            f"Skipping push trigger for {api_url=} because another event already triggered the same processing"
-        )
-        return {}
-    async with _pending_task_duplicate_push_conditions[api_url]:
        if current_active_tasks == 1:
-            # second task waits
+            # second task waits for the in-progress task to finish
            get_logger().info(
                f"Waiting to process push trigger for {api_url=} because the first task is still in progress"
            )
Evidence
The admission section can increment the counter and then await the condition before any try/finally
cleanup is established; cleanup (decrement/pop) only happens in the later finally. Since both dicts
are configured with ttl=None, leaked entries have no time-based eviction fallback.

pr_agent/servers/github_app.py[77-82]
pr_agent/servers/github_app.py[178-201]
pr_agent/servers/github_app.py[202-220]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`handle_push_trigger_for_new_commits()` increments `_duplicate_push_triggers[api_url]` and can suspend on `Condition.wait()` before entering the `try/finally` that decrements and cleans up. If the waiter is cancelled (or `wait()` raises) before reaching the `try/finally`, the per-PR counter/condition entry will not be decremented/removed.

Because these caches are now `ttl=None`, this leaked state can persist for the lifetime of the process and suppress/distort future push-trigger handling for that PR URL.

## Issue Context
The wait happens only for the second admitted task when backlog is enabled (`current_active_tasks == 1`). Cleanup logic currently lives in the later `finally` block.

## Fix Focus Areas
- pr_agent/servers/github_app.py[175-221]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. Silently ignores KeyError 📘 Rule violation ☼ Reliability
Description
The new except KeyError: ... pass suppresses unexpected state errors without logging, which can
hide real bugs and make incidents harder to diagnose. This conflicts with the requirement for robust
error handling rather than ignoring errors.
Code

pr_agent/servers/github_app.py[R210-212]

Evidence
PR Compliance ID 3 disallows ignored errors. The added code catches KeyError and performs no
logging/handling (pass), which is an ignored error path in newly changed lines.

Rule 3: Robust Error Handling
pr_agent/servers/github_app.py[210-212]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`handle_push_trigger_for_new_commits()` catches `KeyError` and silently ignores it with `pass`, which can hide unexpected runtime issues.
## Issue Context
The `finally` cleanup block is intended to be resilient to TTL eviction, but completely suppressing the exception provides no signal when the `KeyError` is unexpected.
## Fix Focus Areas
- pr_agent/servers/github_app.py[205-215]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

  • Author self-review: I have reviewed the code review findings, and addressed the relevant ones.

Previous review results

Review updated until commit 3604957

Results up to commit 215c9c3


🐞 Bugs (3) 📘 Rule violations (1) 📜 Skill insights (0)


Action required
1. Non-atomic dedupe cleanup 🐞 Bug ☼ Reliability
Description
handle_push_trigger_for_new_commits() decides to clean up shared per-PR state while holding the
Condition lock, but performs the actual .pop() outside that lock, allowing a concurrent push-trigger
task to start in between. This can delete the condition/counter for an in-flight task and corrupt
the dedupe bookkeeping (missed wait/notify, inconsistent counters).
Code

pr_agent/servers/github_app.py[R203-215]

+        # release the waiting task block, then clean up if no tasks remain
+        should_cleanup = False
+        try:
+            async with _pending_task_duplicate_push_conditions[api_url]:
+                _pending_task_duplicate_push_conditions[api_url].notify(1)
+                _duplicate_push_triggers[api_url] -= 1
+                should_cleanup = _duplicate_push_triggers[api_url] <= 0
+        except KeyError:
+            # TTL eviction already cleaned up this entry
+            pass
+        if should_cleanup:
+            _duplicate_push_triggers.pop(api_url, None)
+            _pending_task_duplicate_push_conditions.pop(api_url, None)
Evidence
The admission path increments _duplicate_push_triggers[api_url] before acquiring the condition
lock, while the new cleanup code removes the dict entries after releasing the condition
lock—creating a check-then-act race window where a concurrent task can start and then have its state
popped out from under it.

pr_agent/servers/github_app.py[175-195]
pr_agent/servers/github_app.py[202-215]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`handle_push_trigger_for_new_commits()` computes `should_cleanup` under the per-PR `asyncio.Condition` lock, but then removes `_duplicate_push_triggers[api_url]` and `_pending_task_duplicate_push_conditions[api_url]` after releasing that lock. Because the increment/admission path mutates `_duplicate_push_triggers` before acquiring the condition lock, a new task can interleave between the decrement/check and the pop, causing live state to be removed.
### Issue Context
This code is implementing a concurrency/deduplication gate for push triggers per `api_url`. The counter and the condition must have a consistent lifecycle across all concurrent tasks.
### Fix Focus Areas
- pr_agent/servers/github_app.py[175-215]
### Suggested fix direction
- Ensure **all** mutations and lifecycle transitions for both `_duplicate_push_triggers[api_url]` and `_pending_task_duplicate_push_conditions[api_url]` are performed under a single, consistent synchronization mechanism.
- One workable approach: fetch/create the per-PR condition first, then `async with condition:` to protect **both** admission (`setdefault`/increment) and cleanup (decrement + possible pop) for that `api_url`.
- Alternatively, introduce a dedicated lock (global or per-`api_url`) that guards: setdefault/increment, decrement/check, and pop, so no new task can start between the cleanup decision and the removal.
- If you keep the `pop()`, perform it inside the same critical section as the `should_cleanup` decision, and re-check the counter immediately before removing entries.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. pop bypasses TTL bookkeeping 🐞 Bug ☼ Reliability
Description
handle_push_trigger_for_new_commits() removes DefaultDictWithTimeout entries via .pop(), which
does not update DefaultDictWithTimeout’s internal __key_times map. After this PR makes
__refresh() run, a later refresh can try to del self[key] for a key that was already popped from
the dict, raising KeyError and potentially failing webhook handling.
Code

pr_agent/servers/github_app.py[R214-215]

Evidence
The cleanup uses .pop() on DefaultDictWithTimeout, but the class only synchronizes its internal
timestamp map on __delitem__. __refresh() deletes based on __key_times using del self[key],
which will raise if the dict entry was already removed via .pop() and __key_times wasn’t
updated.

pr_agent/servers/github_app.py[203-215]
pr_agent/servers/utils.py[63-72]
pr_agent/servers/utils.py[84-86]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`DefaultDictWithTimeout` maintains a private `__key_times` map that is only kept in sync when deletion goes through `__delitem__`. The new cleanup uses `.pop()`, which bypasses `__delitem__`, leaving stale timestamps behind.
Once `__refresh()` runs (and this PR makes it run at the correct cadence), it will build `to_delete` from `__key_times` and execute `del self[key]`. If the dict entry was previously removed via `.pop()`, `super().__delitem__(key)` raises `KeyError`, propagating out of `__refresh()`/`__getitem__()`.
## Issue Context
This affects both `_duplicate_push_triggers` and `_pending_task_duplicate_push_conditions`, which are `DefaultDictWithTimeout` instances.
## Fix Focus Areas
- pr_agent/servers/github_app.py[203-215]
- pr_agent/servers/utils.py[63-72]
- pr_agent/servers/utils.py[84-86]
## Suggested fix
Prefer one of:
1) **Don’t use `.pop()` on `DefaultDictWithTimeout`**; delete via `del` so `__delitem__` runs, e.g.:

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. TTL evicts in-flight state 🐞 Bug ≡ Correctness
Description
Now that DefaultDictWithTimeout.__refresh() actually runs, it can delete the asyncio.Condition and
counter entries used by handle_push_trigger_for_new_commits while the first task is still
running/waiting, if processing time exceeds the configured TTL. That can cause a waiting task to
never be notified (notify happens on a newly-created Condition) and/or raise KeyError when
decrementing _duplicate_push_triggers in the finally block.
Code

pr_agent/servers/utils.py[R67-68]

Evidence
DefaultDictWithTimeout will now perform periodic TTL-based deletion (every refresh_interval) of any
key whose last recorded access time is older than ttl. The GitHub App uses this structure to store
per-PR-url synchronization primitives and a running-task counter, but it does not touch these dict
keys during the long await of PR processing, so another webhook can trigger a refresh and evict the
key while it is still needed; default TTL is 300 seconds and the e2e test explicitly allows ~5
minutes for processing, making this eviction plausible in normal operation.

pr_agent/servers/utils.py[63-72]
pr_agent/servers/github_app.py[175-206]
pr_agent/settings/configuration.toml[234-255]
tests/e2e_tests/test_github_app.py[23-29]
pr_agent/tools/pr_reviewer.py[152-156]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`DefaultDictWithTimeout` eviction is now active and can delete `_duplicate_push_triggers[api_url]` and `_pending_task_duplicate_push_conditions[api_url]` while `handle_push_trigger_for_new_commits` is still processing. This can break synchronization: a waiter may block forever if `notify()` runs on a newly-created `Condition`, and `_duplicate_push_triggers[api_url] -= 1` can raise if the key was evicted.
## Issue Context
- The dicts are accessed, then the code awaits long-running PR processing (`agent.handle_request` via `_perform_auto_commands_github`) without touching those keys.
- TTL default is 300s, and the e2e GitHub App test allows ~5 minutes processing.
## Fix Focus Areas
- pr_agent/servers/github_app.py[175-206]
- pr_agent/servers/utils.py[63-78]
- pr_agent/settings/configuration.toml[234-252]
## Implementation notes
- Prefer explicit lifecycle cleanup: in the `finally` block, after decrementing, if the counter becomes `0`, delete both dict entries for `api_url` (guard with `try/except KeyError`).
- If you keep TTL, ensure it cannot evict in-flight state: either set TTL to a safe upper bound relative to worst-case processing time or periodically refresh/touch the key during processing.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended
4. Silently ignores KeyError 📘 Rule violation ☼ Reliability
Description
The new except KeyError: ... pass suppresses unexpected state errors without logging, which can
hide real bugs and make incidents harder to diagnose. This conflicts with the requirement for robust
error handling rather than ignoring errors.
Code

pr_agent/servers/github_app.py[R210-212]

Evidence
PR Compliance ID 3 disallows ignored errors. The added code catches KeyError and performs no
logging/handling (pass), which is an ignored error path in newly changed lines.

Rule 3: Robust Error Handling
pr_agent/servers/github_app.py[210-212]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`handle_push_trigger_for_new_commits()` catches `KeyError` and silently ignores it with `pass`, which can hide unexpected runtime issues.
## Issue Context
The `finally` cleanup block is intended to be resilient to TTL eviction, but completely suppressing the exception provides no signal when the `KeyError` is unexpected.
## Fix Focus Areas
- pr_agent/servers/github_app.py[205-215]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Results up to commit 74ded19


🐞 Bugs (3) 📘 Rule violations (1) 📜 Skill insights (0)


Action required
1. Non-atomic dedupe cleanup 🐞 Bug ☼ Reliability
Description
handle_push_trigger_for_new_commits() decides to clean up shared per-PR state while holding the
Condition lock, but performs the actual .pop() outside that lock, allowing a concurrent push-trigger
task to start in between. This can delete the condition/counter for an in-flight task and corrupt
the dedupe bookkeeping (missed wait/notify, inconsistent counters).
Code

pr_agent/servers/github_app.py[R203-215]

+        # release the waiting task block, then clean up if no tasks remain
+        should_cleanup = False
+        try:
+            async with _pending_task_duplicate_push_conditions[api_url]:
+                _pending_task_duplicate_push_conditions[api_url].notify(1)
+                _duplicate_push_triggers[api_url] -= 1
+                should_cleanup = _duplicate_push_triggers[api_url] <= 0
+        except KeyError:
+            # TTL eviction already cleaned up this entry
+            pass
+        if should_cleanup:
+            _duplicate_push_triggers.pop(api_url, None)
+            _pending_task_duplicate_push_conditions.pop(api_url, None)
Evidence
The admission path increments _duplicate_push_triggers[api_url] before acquiring the condition
lock, while the new cleanup code removes the dict entries after releasing the condition
lock—creating a check-then-act race window where a concurrent task can start and then have its state
popped out from under it.

pr_agent/servers/github_app.py[175-195]
pr_agent/servers/github_app.py[202-215]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`handle_push_trigger_for_new_commits()` computes `should_cleanup` under the per-PR `asyncio.Condition` lock, but then removes `_duplicate_push_triggers[api_url]` and `_pending_task_duplicate_push_conditions[api_url]` after releasing that lock. Because the increment/admission path mutates `_duplicate_push_triggers` before acquiring the condition lock, a new task can interleave between the decrement/check and the pop, causing live state to be removed.
### Issue Context
This code is implementing a concurrency/deduplication gate for push triggers per `api_url`. The counter and the condition must have a consistent lifecycle across all concurrent tasks.
### Fix Focus Areas
- pr_agent/servers/github_app.py[175-215]
### Suggested fix direction
- Ensure **all** mutations and lifecycle transitions for both `_duplicate_push_triggers[api_url]` and `_pending_task_duplicate_push_conditions[api_url]` are performed under a single, consistent synchronization mechanism.
- One workable approach: fetch/create the per-PR condition first, then `async with condition:` to protect **both** admission (`setdefault`/increment) and cleanup (decrement + possible pop) for that `api_url`.
- Alternatively, introduce a dedicated lock (global or per-`api_url`) that guards: setdefault/increment, decrement/check, and pop, so no new task can start between the cleanup decision and the removal.
- If you keep the `pop()`, perform it inside the same critical section as the `should_cleanup` decision, and re-check the counter immediately before removing entries.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. pop bypasses TTL bookkeeping 🐞 Bug ☼ Reliability
Description
handle_push_trigger_for_new_commits() removes DefaultDictWithTimeout entries via .pop(), which
does not update DefaultDictWithTimeout’s internal __key_times map. After this PR makes
__refresh() run, a later refresh can try to del self[key] for a key that was already popped from
the dict, raising KeyError and potentially failing webhook handling.
Code

pr_agent/servers/github_app.py[R214-215]

Evidence
The cleanup uses .pop() on DefaultDictWithTimeout, but the class only synchronizes its internal
timestamp map on __delitem__. __refresh() deletes based on __key_times using del self[key],
which will raise if the dict entry was already removed via .pop() and __key_times wasn’t
updated.

pr_agent/servers/github_app.py[203-215]
pr_agent/servers/utils.py[63-72]
pr_agent/servers/utils.py[84-86]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`DefaultDictWithTimeout` maintains a private `__key_times` map that is only kept in sync when deletion goes through `__delitem__`. The new cleanup uses `.pop()`, which bypasses `__delitem__`, leaving stale timestamps behind.
Once `__refresh()` runs (and this PR makes it run at the correct cadence), it will build `to_delete` from `__key_times` and execute `del self[key]`. If the dict entry was previously removed via `.pop()`, `super().__delitem__(key)` raises `KeyError`, propagating out of `__refresh()`/`__getitem__()`.
## Issue Context
This affects both `_duplicate_push_triggers` and `_pending_task_duplicate_push_conditions`, which are `DefaultDictWithTimeout` instances.
## Fix Focus Areas
- pr_agent/servers/github_app.py[203-215]
- pr_agent/servers/utils.py[63-72]
- pr_agent/servers/utils.py[84-86]
## Suggested fix
Prefer one of:
1) **Don’t use `.pop()` on `DefaultDictWithTimeout`**; delete via `del` so `__delitem__` runs, e.g.:

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. TTL evicts in-flight state 🐞 Bug ≡ Correctness
Description
Now that DefaultDictWithTimeout.__refresh() actually runs, it can delete the asyncio.Condition and
counter entries used by handle_push_trigger_for_new_commits while the first task is still
running/waiting, if processing time exceeds the configured TTL. That can cause a waiting task to
never be notified (notify happens on a newly-created Condition) and/or raise KeyError when
decrementing _duplicate_push_triggers in the finally block.
Code

pr_agent/servers/utils.py[R67-68]

Evidence
DefaultDictWithTimeout will now perform periodic TTL-based deletion (every refresh_interval) of any
key whose last recorded access time is older than ttl. The GitHub App uses this structure to store
per-PR-url synchronization primitives and a running-task counter, but it does not touch these dict
keys during the long await of PR processing, so another webhook can trigger a refresh and evict the
key while it is still needed; default TTL is 300 seconds and the e2e test explicitly allows ~5
minutes for processing, making this eviction plausible in normal operation.

pr_agent/servers/utils.py[63-72]
pr_agent/servers/github_app.py[175-206]
pr_agent/settings/configuration.toml[234-255]
tests/e2e_tests/test_github_app.py[23-29]
pr_agent/tools/pr_reviewer.py[152-156]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`DefaultDictWithTimeout` eviction is now active and can delete `_duplicate_push_triggers[api_url]` and `_pending_task_duplicate_push_conditions[api_url]` while `handle_push_trigger_for_new_commits` is still processing. This can break synchronization: a waiter may block forever if `notify()` runs on a newly-created `Condition`, and `_duplicate_push_triggers[api_url] -= 1` can raise if the key was evicted.
## Issue Context
- The dicts are accessed, then the code awaits long-running PR processing (`agent.handle_request` via `_perform_auto_commands_github`) without touching those keys.
- TTL default is 300s, and the e2e GitHub App test allows ~5 minutes processing.
## Fix Focus Areas
- pr_agent/servers/github_app.py[175-206]
- pr_agent/servers/utils.py[63-78]
- pr_agent/settings/configuration.toml[234-252]
## Implementation notes
- Prefer explicit lifecycle cleanup: in the `finally` block, after decrementing, if the counter becomes `0`, delete both dict entries for `api_url` (guard with `try/except KeyError`).
- If you keep TTL, ensure it cannot evict in-flight state: either set TTL to a safe upper bound relative to worst-case processing time or periodically refresh/touch the key during processing.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended
4. Silently ignores KeyError 📘 Rule violation ☼ Reliability
Description
The new except KeyError: ... pass suppresses unexpected state errors without logging, which can
hide real bugs and make incidents harder to diagnose. This conflicts with the requirement for robust
error handling rather than ignoring errors.
Code

pr_agent/servers/github_app.py[R210-212]

Evidence
PR Compliance ID 3 disallows ignored errors. The added code catches KeyError and performs no
logging/handling (pass), which is an ignored error path in newly changed lines.

Rule 3: Robust Error Handling
pr_agent/servers/github_app.py[210-212]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`handle_push_trigger_for_new_commits()` catches `KeyError` and silently ignores it with `pass`, which can hide unexpected runtime issues.
## Issue Context
The `finally` cleanup block is intended to be resilient to TTL eviction, but completely suppressing the exception provides no signal when the `KeyError` is unexpected.
## Fix Focus Areas
- pr_agent/servers/github_app.py[205-215]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Qodo Logo

Comment thread pr_agent/servers/utils.py
@karesansui-u

Copy link
Copy Markdown
Contributor Author

Good catch. You're right that once TTL eviction starts working correctly, it can evict entries while handle_push_trigger_for_new_commits is still processing — particularly if PR processing exceeds the 300s default TTL.

Two options:

  1. Explicit lifecycle cleanup (preferred): In the finally block of handle_push_trigger_for_new_commits, after decrementing the counter, delete both dict entries when the counter reaches 0. Guard with try/except KeyError for the case where TTL already cleaned up. This makes the code self-cleaning regardless of TTL.

  2. Increase TTL: Set the TTL high enough to cover worst-case processing time. Less robust but simpler.

I can add option 1 to this PR if that's the preferred direction. The fix to the guard condition itself is still correct — the current code never evicts anything.

@qodo-free-for-open-source-projects

Copy link
Copy Markdown
Contributor

Code Review by Qodo

Grey Divider

Looking for bugs?

Check back in a few minutes. An AI review agent is analyzing this pull request.

Grey Divider

Qodo Logo

@qodo-free-for-open-source-projects

qodo-free-for-open-source-projects Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 4dc7874

Comment thread pr_agent/servers/github_app.py Outdated
Comment on lines +214 to +215
_duplicate_push_triggers.pop(api_url, None)
_pending_task_duplicate_push_conditions.pop(api_url, None)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Pop bypasses ttl bookkeeping 🐞 Bug ☼ Reliability

handle_push_trigger_for_new_commits() removes DefaultDictWithTimeout entries via .pop(), which
does not update DefaultDictWithTimeout’s internal __key_times map. After this PR makes
__refresh() run, a later refresh can try to del self[key] for a key that was already popped from
the dict, raising KeyError and potentially failing webhook handling.
Agent Prompt
## Issue description
`DefaultDictWithTimeout` maintains a private `__key_times` map that is only kept in sync when deletion goes through `__delitem__`. The new cleanup uses `.pop()`, which bypasses `__delitem__`, leaving stale timestamps behind.

Once `__refresh()` runs (and this PR makes it run at the correct cadence), it will build `to_delete` from `__key_times` and execute `del self[key]`. If the dict entry was previously removed via `.pop()`, `super().__delitem__(key)` raises `KeyError`, propagating out of `__refresh()`/`__getitem__()`.

## Issue Context
This affects both `_duplicate_push_triggers` and `_pending_task_duplicate_push_conditions`, which are `DefaultDictWithTimeout` instances.

## Fix Focus Areas
- pr_agent/servers/github_app.py[203-215]
- pr_agent/servers/utils.py[63-72]
- pr_agent/servers/utils.py[84-86]

## Suggested fix
Prefer one of:
1) **Don’t use `.pop()` on `DefaultDictWithTimeout`**; delete via `del` so `__delitem__` runs, e.g.:
```python
try:
    del _duplicate_push_triggers[api_url]
except KeyError:
    pass
try:
    del _pending_task_duplicate_push_conditions[api_url]
except KeyError:
    pass
```

2) **Implement `pop()` (and possibly `clear()`/`popitem()`) on `DefaultDictWithTimeout`** to also remove `__key_times` entries (`self.__key_times.pop(key, None)`) before delegating to `super().pop(...)`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@naorpeled

Copy link
Copy Markdown
Member

Hey @karesansui-u,
Thanks for opening this!

Some of the issue was already resolved by another commit, can you please rebase and add tests?

When the active task counter reaches 0, delete both dict entries
immediately so TTL eviction cannot race with in-flight processing.
Guard with try/except KeyError for the case where TTL already evicted.
@naorpeled naorpeled force-pushed the fix/ttl-refresh-guard-inversion branch from 4dc7874 to fd0a503 Compare July 3, 2026 09:40
@github-actions github-actions Bot added the bug label Jul 3, 2026
Comment thread pr_agent/servers/github_app.py Outdated
Comment on lines +203 to +215
# release the waiting task block, then clean up if no tasks remain
should_cleanup = False
try:
async with _pending_task_duplicate_push_conditions[api_url]:
_pending_task_duplicate_push_conditions[api_url].notify(1)
_duplicate_push_triggers[api_url] -= 1
should_cleanup = _duplicate_push_triggers[api_url] <= 0
except KeyError:
# TTL eviction already cleaned up this entry
pass
if should_cleanup:
_duplicate_push_triggers.pop(api_url, None)
_pending_task_duplicate_push_conditions.pop(api_url, None)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Non-atomic dedupe cleanup 🐞 Bug ☼ Reliability

handle_push_trigger_for_new_commits() decides to clean up shared per-PR state while holding the
Condition lock, but performs the actual .pop() outside that lock, allowing a concurrent push-trigger
task to start in between. This can delete the condition/counter for an in-flight task and corrupt
the dedupe bookkeeping (missed wait/notify, inconsistent counters).
Agent Prompt
### Issue description
`handle_push_trigger_for_new_commits()` computes `should_cleanup` under the per-PR `asyncio.Condition` lock, but then removes `_duplicate_push_triggers[api_url]` and `_pending_task_duplicate_push_conditions[api_url]` after releasing that lock. Because the increment/admission path mutates `_duplicate_push_triggers` before acquiring the condition lock, a new task can interleave between the decrement/check and the pop, causing live state to be removed.

### Issue Context
This code is implementing a concurrency/deduplication gate for push triggers per `api_url`. The counter and the condition must have a consistent lifecycle across all concurrent tasks.

### Fix Focus Areas
- pr_agent/servers/github_app.py[175-215]

### Suggested fix direction
- Ensure **all** mutations and lifecycle transitions for both `_duplicate_push_triggers[api_url]` and `_pending_task_duplicate_push_conditions[api_url]` are performed under a single, consistent synchronization mechanism.
  - One workable approach: fetch/create the per-PR condition first, then `async with condition:` to protect **both** admission (`setdefault`/increment) and cleanup (decrement + possible pop) for that `api_url`.
  - Alternatively, introduce a dedicated lock (global or per-`api_url`) that guards: setdefault/increment, decrement/check, and pop, so no new task can start between the cleanup decision and the removal.
- If you keep the `pop()`, perform it inside the same critical section as the `should_cleanup` decision, and re-check the counter immediately before removing entries.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@qodo-free-for-open-source-projects

Copy link
Copy Markdown
Contributor

Code review by qodo was updated up to the latest commit fd0a503

Update the completion test to assert the dedupe entries are removed (the new
self-cleaning contract) and add coverage for the two new branches:
- cleanup is skipped while another task is still active
- a TTL-evicted counter is tolerated in the finally block (KeyError guard)
@qodo-free-for-open-source-projects

Copy link
Copy Markdown
Contributor

Code review by qodo was updated up to the latest commit e10a438

Resolve CodeQL py/mixed-returns on handle_push_trigger_for_new_commits by
returning {} explicitly on the success path, matching every other return in
the function instead of falling through to an implicit None.
@qodo-free-for-open-source-projects

Copy link
Copy Markdown
Contributor

Code review by qodo was updated up to the latest commit 53da118

- DefaultDictWithTimeout.pop() now keeps the internal __key_times map in
  sync. The finally block cleaned up via .pop(), which bypassed __delitem__
  and left a stale timestamp; a later __refresh() would then run
  'del self[key]' for an already-removed key and raise KeyError.
- Move the decrement, the no-tasks-left decision, and the entry removal all
  under the same per-PR condition lock so a newly admitted task cannot
  interleave between the decision and the removal.
- Log the KeyError from a TTL-evicted counter instead of silently passing,
  and leave any stray condition entry for the TTL sweep in that case.
- Add unit tests for pop() bookkeeping (incl. a regression for the
  refresh-after-pop KeyError).
Comment thread tests/unittest/test_github_app_timeout_core.py Fixed
Comment thread tests/unittest/test_github_app_timeout_core.py Fixed
@qodo-free-for-open-source-projects

Copy link
Copy Markdown
Contributor

Code review by qodo was updated up to the latest commit 74ded19

CodeQL py/side-effect-in-assert: an assert expression is skipped under
python -O, so the mutating pop() must run on its own line.
@qodo-free-for-open-source-projects

Copy link
Copy Markdown
Contributor

Code review by qodo was updated up to the latest commit 215c9c3

…ction

Addresses the remaining review feedback on the push-trigger dedupe:

- Non-atomic cleanup: admission (setdefault/increment/wait) now runs under the
  same per-PR condition lock as the finally cleanup, so the counter and
  condition have a single consistent lifecycle and no task can interleave
  between the cleanup decision and the removal.
- TTL evicts in-flight state: switch both dicts to ttl=None. The deterministic
  finally cleanup already bounds memory (entries are removed once no task
  remains), and disabling eviction removes the risk of deleting a condition/
  counter that an in-flight or waiting task still needs. Drop the now-unused
  push_trigger_pending_tasks_ttl config.
- With eviction gone and admission locked, the counter always exists during
  cleanup, so the defensive try/except KeyError (the last silently-handled
  error) is removed.

Replace the obsolete TTL-eviction test with a concurrent backlog test that
exercises the wait/notify path and asserts the shared state is cleaned up only
after both tasks drain.
@qodo-free-for-open-source-projects

Copy link
Copy Markdown
Contributor

Code review by qodo was updated up to the latest commit 3604957

@naorpeled naorpeled closed this Jul 4, 2026
@naorpeled

Copy link
Copy Markdown
Member

Closing this as the main issue was fixed by another PR, I tried to further fix stuff but it feels like the change that landed was good enough.

I'm open to re-revisiting this in the future.
Thanks for your work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants