fix: invert TTL refresh guard in DefaultDictWithTimeout (memory leak)#2255
fix: invert TTL refresh guard in DefaultDictWithTimeout (memory leak)#2255karesansui-u wants to merge 6 commits into
Conversation
Review Summary by QodoFix inverted TTL refresh guard causing memory leak
WalkthroughsDescription• Inverts TTL refresh guard condition from > to < • Fixes memory leak in DefaultDictWithTimeout expiration logic • Prevents unbounded growth of duplicate push trigger caches • Ensures expired keys are properly cleaned up periodically Diagramflowchart LR
A["Guard Condition<br/>request_time - last_refresh"] --> B{"Comparison<br/>Operator"}
B -->|Before: >| C["Returns early<br/>when refresh needed"]
B -->|After: <| D["Returns early<br/>when refresh not needed"]
C --> E["Expired keys<br/>never deleted"]
D --> F["Expired keys<br/>properly cleaned"]
E --> G["Memory Leak"]
F --> H["Fixed"]
File Changes1. pr_agent/servers/utils.py
|
Code Review by Qodo
1. Non-atomic dedupe cleanup
|
|
Good catch. You're right that once TTL eviction starts working correctly, it can evict entries while Two options:
I can add option 1 to this PR if that's the preferred direction. The fix to the guard condition itself is still correct — the current code never evicts anything. |
|
Persistent review updated to latest commit 4dc7874 |
| _duplicate_push_triggers.pop(api_url, None) | ||
| _pending_task_duplicate_push_conditions.pop(api_url, None) |
There was a problem hiding this comment.
1. Pop bypasses ttl bookkeeping 🐞 Bug ☼ Reliability
handle_push_trigger_for_new_commits() removes DefaultDictWithTimeout entries via .pop(), which does not update DefaultDictWithTimeout’s internal __key_times map. After this PR makes __refresh() run, a later refresh can try to del self[key] for a key that was already popped from the dict, raising KeyError and potentially failing webhook handling.
Agent Prompt
## Issue description
`DefaultDictWithTimeout` maintains a private `__key_times` map that is only kept in sync when deletion goes through `__delitem__`. The new cleanup uses `.pop()`, which bypasses `__delitem__`, leaving stale timestamps behind.
Once `__refresh()` runs (and this PR makes it run at the correct cadence), it will build `to_delete` from `__key_times` and execute `del self[key]`. If the dict entry was previously removed via `.pop()`, `super().__delitem__(key)` raises `KeyError`, propagating out of `__refresh()`/`__getitem__()`.
## Issue Context
This affects both `_duplicate_push_triggers` and `_pending_task_duplicate_push_conditions`, which are `DefaultDictWithTimeout` instances.
## Fix Focus Areas
- pr_agent/servers/github_app.py[203-215]
- pr_agent/servers/utils.py[63-72]
- pr_agent/servers/utils.py[84-86]
## Suggested fix
Prefer one of:
1) **Don’t use `.pop()` on `DefaultDictWithTimeout`**; delete via `del` so `__delitem__` runs, e.g.:
```python
try:
del _duplicate_push_triggers[api_url]
except KeyError:
pass
try:
del _pending_task_duplicate_push_conditions[api_url]
except KeyError:
pass
```
2) **Implement `pop()` (and possibly `clear()`/`popitem()`) on `DefaultDictWithTimeout`** to also remove `__key_times` entries (`self.__key_times.pop(key, None)`) before delegating to `super().pop(...)`.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
|
Hey @karesansui-u, Some of the issue was already resolved by another commit, can you please rebase and add tests? |
When the active task counter reaches 0, delete both dict entries immediately so TTL eviction cannot race with in-flight processing. Guard with try/except KeyError for the case where TTL already evicted.
4dc7874 to
fd0a503
Compare
| # release the waiting task block, then clean up if no tasks remain | ||
| should_cleanup = False | ||
| try: | ||
| async with _pending_task_duplicate_push_conditions[api_url]: | ||
| _pending_task_duplicate_push_conditions[api_url].notify(1) | ||
| _duplicate_push_triggers[api_url] -= 1 | ||
| should_cleanup = _duplicate_push_triggers[api_url] <= 0 | ||
| except KeyError: | ||
| # TTL eviction already cleaned up this entry | ||
| pass | ||
| if should_cleanup: | ||
| _duplicate_push_triggers.pop(api_url, None) | ||
| _pending_task_duplicate_push_conditions.pop(api_url, None) |
There was a problem hiding this comment.
1. Non-atomic dedupe cleanup 🐞 Bug ☼ Reliability
handle_push_trigger_for_new_commits() decides to clean up shared per-PR state while holding the Condition lock, but performs the actual .pop() outside that lock, allowing a concurrent push-trigger task to start in between. This can delete the condition/counter for an in-flight task and corrupt the dedupe bookkeeping (missed wait/notify, inconsistent counters).
Agent Prompt
### Issue description
`handle_push_trigger_for_new_commits()` computes `should_cleanup` under the per-PR `asyncio.Condition` lock, but then removes `_duplicate_push_triggers[api_url]` and `_pending_task_duplicate_push_conditions[api_url]` after releasing that lock. Because the increment/admission path mutates `_duplicate_push_triggers` before acquiring the condition lock, a new task can interleave between the decrement/check and the pop, causing live state to be removed.
### Issue Context
This code is implementing a concurrency/deduplication gate for push triggers per `api_url`. The counter and the condition must have a consistent lifecycle across all concurrent tasks.
### Fix Focus Areas
- pr_agent/servers/github_app.py[175-215]
### Suggested fix direction
- Ensure **all** mutations and lifecycle transitions for both `_duplicate_push_triggers[api_url]` and `_pending_task_duplicate_push_conditions[api_url]` are performed under a single, consistent synchronization mechanism.
- One workable approach: fetch/create the per-PR condition first, then `async with condition:` to protect **both** admission (`setdefault`/increment) and cleanup (decrement + possible pop) for that `api_url`.
- Alternatively, introduce a dedicated lock (global or per-`api_url`) that guards: setdefault/increment, decrement/check, and pop, so no new task can start between the cleanup decision and the removal.
- If you keep the `pop()`, perform it inside the same critical section as the `should_cleanup` decision, and re-check the counter immediately before removing entries.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
|
Code review by qodo was updated up to the latest commit fd0a503 |
Update the completion test to assert the dedupe entries are removed (the new self-cleaning contract) and add coverage for the two new branches: - cleanup is skipped while another task is still active - a TTL-evicted counter is tolerated in the finally block (KeyError guard)
|
Code review by qodo was updated up to the latest commit e10a438 |
Resolve CodeQL py/mixed-returns on handle_push_trigger_for_new_commits by
returning {} explicitly on the success path, matching every other return in
the function instead of falling through to an implicit None.
|
Code review by qodo was updated up to the latest commit 53da118 |
- DefaultDictWithTimeout.pop() now keeps the internal __key_times map in sync. The finally block cleaned up via .pop(), which bypassed __delitem__ and left a stale timestamp; a later __refresh() would then run 'del self[key]' for an already-removed key and raise KeyError. - Move the decrement, the no-tasks-left decision, and the entry removal all under the same per-PR condition lock so a newly admitted task cannot interleave between the decision and the removal. - Log the KeyError from a TTL-evicted counter instead of silently passing, and leave any stray condition entry for the TTL sweep in that case. - Add unit tests for pop() bookkeeping (incl. a regression for the refresh-after-pop KeyError).
|
Code review by qodo was updated up to the latest commit 74ded19 |
CodeQL py/side-effect-in-assert: an assert expression is skipped under python -O, so the mutating pop() must run on its own line.
|
Code review by qodo was updated up to the latest commit 215c9c3 |
…ction Addresses the remaining review feedback on the push-trigger dedupe: - Non-atomic cleanup: admission (setdefault/increment/wait) now runs under the same per-PR condition lock as the finally cleanup, so the counter and condition have a single consistent lifecycle and no task can interleave between the cleanup decision and the removal. - TTL evicts in-flight state: switch both dicts to ttl=None. The deterministic finally cleanup already bounds memory (entries are removed once no task remains), and disabling eviction removes the risk of deleting a condition/ counter that an in-flight or waiting task still needs. Drop the now-unused push_trigger_pending_tasks_ttl config. - With eviction gone and admission locked, the counter always exists during cleanup, so the defensive try/except KeyError (the last silently-handled error) is removed. Replace the obsolete TTL-eviction test with a concurrent backlog test that exercises the wait/notify path and asserts the shared state is cleaned up only after both tasks drain.
|
Code review by qodo was updated up to the latest commit 3604957 |
|
Closing this as the main issue was fixed by another PR, I tried to further fix stuff but it feels like the change that landed was good enough. I'm open to re-revisiting this in the future. |

Bug description
DefaultDictWithTimeout.__refresh()has an inverted guard condition. The comparison>should be<:This causes expired keys to never be cleaned up. The
to_deleteloop only runs whenrequest_time - __last_refreshis small (no time passed = nothing to delete), and skips when enough time has passed (when cleanup is actually needed).Impact
_duplicate_push_triggersand_pending_task_duplicate_push_conditionsin the GitHub App server grow without bound. Every processed PR URL stays in memory forever. On long-running server instances, this is a memory leak.Fix
Change
>to<on line 67 ofpr_agent/servers/utils.py.Affected files
pr_agent/servers/utils.py(L67) — 1 character change