Skip to content

[FLINK-40069][tests] Stabilize OpenTelemetryMetricReporterProtocolTest against collector startup race#28635

Open
MartijnVisser wants to merge 2 commits into
apache:masterfrom
MartijnVisser:fix-otel-collector-readiness-race
Open

[FLINK-40069][tests] Stabilize OpenTelemetryMetricReporterProtocolTest against collector startup race#28635
MartijnVisser wants to merge 2 commits into
apache:masterfrom
MartijnVisser:fix-otel-collector-readiness-race

Conversation

@MartijnVisser

Copy link
Copy Markdown
Contributor

What is the purpose of the change

Fixes intermittent failures of OpenTelemetryMetricReporterProtocolTest.testGzipCompressionGrpc/Http (MismatchedInputException: No content to map due to end-of-input after exhausting the full 2-minute retry budget; Azure build 76609, leg test_cron_jdk21_connect). Two issues combined: (1) OtelTestContainer used the default HostPortWaitStrategy, which false-reports readiness for the shell-less collector image (its internal exec check cannot run), so the reporter's export, invoked ~160ms before the collector logged readiness, began against a not-yet-listening OTLP receiver and failed; (2) the test exported exactly once before polling, so that single failed export left the collector output file empty for the entire timeout. Gzip is incidental: those two tests simply ran first against the cold container.

Brief change log

  • OtelTestContainer: wait for the collector's "Everything is ready. Begin running and processing data." log line (emitted only after all components including the OTLP receivers have started, unlike the health_check extension which can report healthy independently of receiver readiness) with a bounded 1-minute startup timeout.
  • OpenTelemetryTestBase: add an eventuallyConsumeJson overload with a pre-attempt hook; the existing single-arg method delegates with a no-op.
  • OpenTelemetryMetricReporterProtocolTest: re-export (report() + waitForLastReportToComplete()) inside the retry loop rather than once beforehand; this also covers transient export failures such as an observed one-off HTTP 404.

Verifying this change

This change is already covered by existing tests. Verified locally with Docker: the three protocol classes (metrics/events/traces) and both ITCases pass twice in a row (29 tests, 0 failures), with the metrics protocol class completing in seconds versus the 256s two-error CI failure.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no

Was generative AI tooling used to co-author this PR?
  • Yes (Claude Opus 4.8, via Claude Code)

Generated-by: Claude Opus 4.8 (1M context)

…tainer

The default HostPortWaitStrategy reports the shell-less collector image as ready before its OTLP receiver accepts connections, so the test's first export could fail against a not-yet-listening receiver. Wait for the collector's readiness log line, which is only emitted after all components including the receivers have started, with a bounded startup timeout.

Generated-by: Claude Opus 4.8 (1M context)
The metric protocol test exported exactly once before polling the collector output file, so any single failed export (racing collector startup, or a transient HTTP 404) left the file empty and doomed the whole retry budget. Re-invoke report() inside each eventually() iteration via a pre-attempt hook; the assertion reads only the last line, so repeated exports are safe.

Generated-by: Claude Opus 4.8 (1M context)
@flinkbot

flinkbot commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@spuru9 spuru9 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor nit.

}

@Override
protected void setupAndReport(MetricConfig config) {

@spuru9 spuru9 Jul 4, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protected void setup(MetricConfig config) {

We are no longer reporting in this function so should be renamed? Will need changes in the caller function as well.

@github-actions github-actions Bot added the community-reviewed PR has been reviewed by the community. label Jul 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants