Skip to content

HDDS-15059. Shift streaming write sortDatanodes logic to OM#10633

Open
chihsuan wants to merge 10 commits into
apache:masterfrom
chihsuan:HDDS-15059
Open

HDDS-15059. Shift streaming write sortDatanodes logic to OM#10633
chihsuan wants to merge 10 commits into
apache:masterfrom
chihsuan:HDDS-15059

Conversation

@chihsuan

@chihsuan chihsuan commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

SCM sorts the write pipeline (nearest datanode first) on every allocateBlock, on its block-allocation hot path. OM already caches the cluster topology (HDDS-9343) and sorts reads locally, so this PR moves the write sort to OM.

  • OMKeyRequest.allocateBlock sends an empty clientMachine to SCM (SCM skips sorting) and sorts each pipeline locally via a new KeyManager.sortDatanodesForWrite; the result is cached per pipeline.
  • Pipeline nodes returned over RPC aren't linked to OM's topology, so they would all look equidistant and sort randomly. OM resolves each node (and the client) to its canonical node in the cached cluster map before sorting, then maps the order back to the original nodes, matching what SCM's nodeManager.getNode did.
  • The cluster map is read once per sort and used for both client resolution and sorting, so a topology refresh mid-sort can't leave them on different snapshots and shuffle the order.
  • The client is matched by both IP and hostname (UserInfo.remoteAddress is always an IP), so a client co-located on a datanode is recognized even with hdds.datanode.use.datanode.hostname enabled.
  • When the client is empty or unresolved, the pipeline order is kept (no shuffle), since the first node is the write primary.
  • SCMBlockProtocolServer is unchanged for rolling-upgrade safety: an old OM still gets SCM-side sorting, a new OM's empty address is a no-op for SCM. No Protobuf/RPC change.

Note: SCM's ALLOCATE_BLOCK audit now logs client="" for OM-originated writes; the per-client audit stays at OM.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15059

How was this patch tested?

  • Unit (TestOMAllocateBlockRequest): SCM receives an empty clientMachine; the sorted order is applied to every block; a shared pipeline is sorted once.
  • Integration (TestOMSortDatanodes): nearest datanode is first for writes, including for RPC-deserialized (protobuf round-tripped) pipeline nodes; order is preserved for an empty or unresolved client; the client is matched by both IP and hostname.
  • Fork build-branch CI: https://github.com/chihsuan/ozone/actions/runs/28455187435

Generated-by: Claude Code (Claude Opus 4.8)

@chihsuan chihsuan marked this pull request as ready for review June 30, 2026 17:08

@peterxcli peterxcli left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivandika3 Is there any risk that a block ends up allocated on a suboptimal datanodes/pipeline because the OM cache topology is somehow stale?⁠

@ivandika3

ivandika3 commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@peterxcli thanks for checking.

Is there any risk that a block ends up allocated on a suboptimal datanodes/pipeline because the OM cache topology is somehow stale?⁠

Should have some effect on performance but not correctness, since Streaming Write Pipeline should be able to pick an arbitrary topology. The the data path (streaming WriteChunk data) can be sent to any primary (first node) is separated from the metadata path (PutBlock) which will be sent to the DN leader. The impact of suboptimal Streaming write pipeline topology should be worse write latency (e.g. if the topology picks the furthest node as the primary node). But this possible performance penalty also apply to read path (i.e. where the further node is read first) so I think it should be acceptable. Please let me know if I miss something.

@chihsuan Thanks for the patch, I'll review this soon.

@ivandika3 ivandika3 requested a review from szetszwo July 2, 2026 01:36
@szetszwo

szetszwo commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

... We can do the sorting on the OM instead of SCM to reduce SCM RPC processing time.

@ivandika3 , @chihsuan , Would this change improve the overall performance?

OM is supposed to be more CPU demanding than SCM. So, we probably want to save the CPU cycles in OM. What do you think?

@ivandika3

ivandika3 commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

@szetszwo Thanks for checking this out.

The reason I raised for this is to follow the HDDS-9343 which should make the sort datanodes logic done in OM.

Would this change improve the overall performance?

I have not measured the overall performance difference yet, but for a single OM and single OM service, the performance improvement (if any) might not be that much.

OM is supposed to be more CPU demanding than SCM. So, we probably want to save the CPU cycles in OM. What do you think?

That is a good point. However, in our cluster we have multiple OM services that points to a single SCM service, so I think SCM service resources should be protected more since it can be the bottleneck as more OM services point to the same SCM service. Additionally, ideally SCM should spend most of its times on background services (processing heartbeat reports, etc) and therefore should spend as little time as possible in the user foreground processing (i.e. allocating blocks, fetching read pipelines, etc) which includes sort datanodes.

Please let me know what you think. This is not really a critical issue so I think we can defer it if you don't see any reason to support it now.

@szetszwo

szetszwo commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

... in our cluster we have multiple OM services ...

Good to know that you are running with multiple OM services!

... This is not really a critical issue so I think we can defer it if you don't see any reason to support it now.

How about making it configurable?

@ivandika3

Copy link
Copy Markdown
Contributor

How about making it configurable?

Yes, making it configurable is a good idea.

@chihsuan

chihsuan commented Jul 4, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @szetszwo and @ivandika3 for the discussion.

I'll add an OM-side boolean (e.g. ozone.om.block.write.sort.datanodes.enabled, name open to bikeshed), defaulting to false (SCM sorts, legacy behavior) so the default doesn't add OM CPU. Set true to sort on OM instead, for multiple OMs behind one SCM.

Does false sound like the right default, or would you prefer true?

@ivandika3

ivandika3 commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Does false sound like the right default, or would you prefer true?

@chihsuan Thanks for the follow up, let's set the default to false to preserve the current behavior. We can remove this configuration in the future if there is a significant performance improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants