[FLINK-40069][tests] Stabilize OpenTelemetryMetricReporterProtocolTest against collector startup race#28635
Open
MartijnVisser wants to merge 2 commits into
Open
Conversation
…tainer The default HostPortWaitStrategy reports the shell-less collector image as ready before its OTLP receiver accepts connections, so the test's first export could fail against a not-yet-listening receiver. Wait for the collector's readiness log line, which is only emitted after all components including the receivers have started, with a bounded startup timeout. Generated-by: Claude Opus 4.8 (1M context)
The metric protocol test exported exactly once before polling the collector output file, so any single failed export (racing collector startup, or a transient HTTP 404) left the file empty and doomed the whole retry budget. Re-invoke report() inside each eventually() iteration via a pre-attempt hook; the assertion reads only the last line, so repeated exports are safe. Generated-by: Claude Opus 4.8 (1M context)
Collaborator
spuru9
reviewed
Jul 4, 2026
| } | ||
|
|
||
| @Override | ||
| protected void setupAndReport(MetricConfig config) { |
Contributor
There was a problem hiding this comment.
protected void setup(MetricConfig config) {
We are no longer reporting in this function so should be renamed? Will need changes in the caller function as well.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is the purpose of the change
Fixes intermittent failures of
OpenTelemetryMetricReporterProtocolTest.testGzipCompressionGrpc/Http(MismatchedInputException: No content to map due to end-of-inputafter exhausting the full 2-minute retry budget; Azure build 76609, legtest_cron_jdk21_connect). Two issues combined: (1)OtelTestContainerused the defaultHostPortWaitStrategy, which false-reports readiness for the shell-less collector image (its internal exec check cannot run), so the reporter's export, invoked ~160ms before the collector logged readiness, began against a not-yet-listening OTLP receiver and failed; (2) the test exported exactly once before polling, so that single failed export left the collector output file empty for the entire timeout. Gzip is incidental: those two tests simply ran first against the cold container.Brief change log
OtelTestContainer: wait for the collector's "Everything is ready. Begin running and processing data." log line (emitted only after all components including the OTLP receivers have started, unlike the health_check extension which can report healthy independently of receiver readiness) with a bounded 1-minute startup timeout.OpenTelemetryTestBase: add aneventuallyConsumeJsonoverload with a pre-attempt hook; the existing single-arg method delegates with a no-op.OpenTelemetryMetricReporterProtocolTest: re-export (report()+waitForLastReportToComplete()) inside the retry loop rather than once beforehand; this also covers transient export failures such as an observed one-off HTTP 404.Verifying this change
This change is already covered by existing tests. Verified locally with Docker: the three protocol classes (metrics/events/traces) and both ITCases pass twice in a row (29 tests, 0 failures), with the metrics protocol class completing in seconds versus the 256s two-error CI failure.
Does this pull request potentially affect one of the following parts:
@Public(Evolving): noDocumentation
Was generative AI tooling used to co-author this PR?
Generated-by: Claude Opus 4.8 (1M context)