[FLINK-40068][tests] Do not assert apply() was called when cancelling KeyedJob window#28634
Open
MartijnVisser wants to merge 1 commit into
Open
[FLINK-40068][tests] Do not assert apply() was called when cancelling KeyedJob window#28634MartijnVisser wants to merge 1 commit into
MartijnVisser wants to merge 1 commit into
Conversation
… KeyedJob window StatefulWindowFunction.close() also runs on the cancellation path, and the GENERATE/MIGRATE jobs are always stopped via non-draining cancel-with-savepoint, so a window subtask can be closed before apply() was called; the failing assertion then kills the shared MiniCluster's only TaskManager and starves every later parameterization. Restrict the assertion to RESTORE, the only mode whose job runs to completion. Generated-by: Claude Opus 4.8 (1M context)
Collaborator
spuru9
approved these changes
Jul 4, 2026
spuru9
left a comment
Contributor
There was a problem hiding this comment.
LGTM.
Clean and straightforward fix.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is the purpose of the change
Fixes the recurring
KeyedComplexChainTest.testMigrationAndRestorefailures (NoResourceAvailableException, Azure builds 76400, 76448, 76627), the follow-up investigation deferred in FLINK-39918.KeyedJob.StatefulWindowFunction.close()assertedapplyCalledon every close, butclose()also runs on the cancellation path, and the GENERATE/MIGRATE jobs are always stopped via non-draining cancel-with-savepoint, so under CI load a window subtask can be closed before its element was processed. TheAssertionFailedErrorthrown fromclose()fails the shared static MiniCluster's only TaskManager (NUM_TMS=1, no restart), so the subsequent restore step and every later savepoint parameterization starve withNoResourceAvailableExceptionafter the slot-request timeout. There is no slot leak and no data loss: the runtime released all slots correctly, and the window's keyed state is restored from the savepoint independently ofapply()having run.Brief change log
StatefulWindowFunction.close()'sapplyCalledassertion toExecutionMode.RESTORE, the only mode whose job runs to completion.Verifying this change
This change is already covered by existing tests:
KeyedComplexChainTestran 3x locally, 16/16 green each time. The original failure needs cancel-with-savepoint to beat element delivery under CI load and is not reproducible locally.Coverage trade-off: MIGRATE loses the
applyCalledfail-safe, but the RESTORE run re-validates the migrated state end-to-end via the untouchedapply()state-comparison assertions, so migration correctness is still verified.Does this pull request potentially affect one of the following parts:
@Public(Evolving): noDocumentation
Was generative AI tooling used to co-author this PR?
Generated-by: Claude Opus 4.8 (1M context)