HDDS-15059. Shift streaming write sortDatanodes logic to OM by chihsuan · Pull Request #10633 · apache/ozone

chihsuan · 2026-06-29T17:05:46Z

What changes were proposed in this pull request?

SCM sorts the write pipeline (nearest datanode first) on every allocateBlock, on its block-allocation hot path. OM already caches the cluster topology (HDDS-9343) and sorts reads locally, so this PR moves the write sort to OM.

OMKeyRequest.allocateBlock sends an empty clientMachine to SCM (SCM skips sorting) and sorts each pipeline locally via a new KeyManager.sortDatanodesForWrite; the result is cached per pipeline.
Pipeline nodes returned over RPC aren't linked to OM's topology, so they would all look equidistant and sort randomly. OM resolves each node (and the client) to its canonical node in the cached cluster map before sorting, then maps the order back to the original nodes, matching what SCM's nodeManager.getNode did.
The cluster map is read once per sort and used for both client resolution and sorting, so a topology refresh mid-sort can't leave them on different snapshots and shuffle the order.
The client is matched by both IP and hostname (UserInfo.remoteAddress is always an IP), so a client co-located on a datanode is recognized even with hdds.datanode.use.datanode.hostname enabled.
When the client is empty or unresolved, the pipeline order is kept (no shuffle), since the first node is the write primary.
SCMBlockProtocolServer is unchanged for rolling-upgrade safety: an old OM still gets SCM-side sorting, a new OM's empty address is a no-op for SCM. No Protobuf/RPC change.

Note: SCM's ALLOCATE_BLOCK audit now logs client="" for OM-originated writes; the per-client audit stays at OM.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15059

How was this patch tested?

Unit (TestOMAllocateBlockRequest): SCM receives an empty clientMachine; the sorted order is applied to every block; a shared pipeline is sorted once.
Integration (TestOMSortDatanodes): nearest datanode is first for writes, including for RPC-deserialized (protobuf round-tripped) pipeline nodes; order is preserved for an empty or unresolved client; the client is matched by both IP and hostname.
Fork build-branch CI: https://github.com/chihsuan/ozone/actions/runs/28455187435

Generated-by: Claude Code (Claude Opus 4.8)

…lock

…comment

…eBlock

… client

peterxcli

@ivandika3 Is there any risk that a block ends up allocated on a suboptimal datanodes/pipeline because the OM cache topology is somehow stale?⁠

ivandika3 · 2026-07-02T01:36:19Z

@peterxcli thanks for checking.

Is there any risk that a block ends up allocated on a suboptimal datanodes/pipeline because the OM cache topology is somehow stale?⁠

Should have some effect on performance but not correctness, since Streaming Write Pipeline should be able to pick an arbitrary topology. The the data path (streaming WriteChunk data) can be sent to any primary (first node) is separated from the metadata path (PutBlock) which will be sent to the DN leader. The impact of suboptimal Streaming write pipeline topology should be worse write latency (e.g. if the topology picks the furthest node as the primary node). But this possible performance penalty also apply to read path (i.e. where the further node is read first) so I think it should be acceptable. Please let me know if I miss something.

@chihsuan Thanks for the patch, I'll review this soon.

szetszwo · 2026-07-02T18:54:50Z

... We can do the sorting on the OM instead of SCM to reduce SCM RPC processing time.

@ivandika3 , @chihsuan , Would this change improve the overall performance?

OM is supposed to be more CPU demanding than SCM. So, we probably want to save the CPU cycles in OM. What do you think?

ivandika3 · 2026-07-03T06:05:13Z

@szetszwo Thanks for checking this out.

The reason I raised for this is to follow the HDDS-9343 which should make the sort datanodes logic done in OM.

Would this change improve the overall performance?

I have not measured the overall performance difference yet, but for a single OM and single OM service, the performance improvement (if any) might not be that much.

OM is supposed to be more CPU demanding than SCM. So, we probably want to save the CPU cycles in OM. What do you think?

That is a good point. However, in our cluster we have multiple OM services that points to a single SCM service, so I think SCM service resources should be protected more since it can be the bottleneck as more OM services point to the same SCM service. Additionally, ideally SCM should spend most of its times on background services (processing heartbeat reports, etc) and therefore should spend as little time as possible in the user foreground processing (i.e. allocating blocks, fetching read pipelines, etc) which includes sort datanodes.

Please let me know what you think. This is not really a critical issue so I think we can defer it if you don't see any reason to support it now.

szetszwo · 2026-07-03T15:44:45Z

... in our cluster we have multiple OM services ...

Good to know that you are running with multiple OM services!

... This is not really a critical issue so I think we can defer it if you don't see any reason to support it now.

How about making it configurable?

ivandika3 · 2026-07-04T04:20:31Z

How about making it configurable?

Yes, making it configurable is a good idea.

chihsuan · 2026-07-04T07:03:24Z

Thanks @szetszwo and @ivandika3 for the discussion.

I'll add an OM-side boolean (e.g. ozone.om.block.write.sort.datanodes.enabled, name open to bikeshed), defaulting to false (SCM sorts, legacy behavior) so the default doesn't add OM CPU. Set true to sort on OM instead, for multiple OMs behind one SCM.

Does false sound like the right default, or would you prefer true?

ivandika3 · 2026-07-04T07:26:23Z

Does false sound like the right default, or would you prefer true?

@chihsuan Thanks for the follow up, let's set the default to false to preserve the current behavior. We can remove this configuration in the future if there is a significant performance improvement.

chihsuan added 7 commits June 29, 2026 21:54

HDDS-15059. Add OM-side sortDatanodesForWrite with latency metric

a767f7b

HDDS-15059. Sort write pipeline on OM instead of SCM during allocateB…

35ad85c

…lock

HDDS-15059. Address review: metric getter visibility and test helper …

c3ea84c

…comment

HDDS-15059. Guard empty client and cache per-pipeline sort in allocat…

fa5a466

…eBlock

HDDS-15059. Strengthen allocateBlock sort tests for reorder and empty…

5b19b72

… client

HDDS-15059. Match write-sort client by both hostname and IP

e29a899

HDDS-15059. Resolve write-sort pipeline nodes to OM cluster map

91bed08

chihsuan force-pushed the HDDS-15059 branch from 928a105 to 91bed08 Compare June 30, 2026 14:38

chihsuan added 3 commits June 30, 2026 23:13

HDDS-15059. Consolidate write-sort datanode tests

59f9c26

HDDS-15059. Read cluster map once per sort to avoid topology swap race

13f1e1a

HDDS-15059. Simplify write-sort client handling

5040d00

chihsuan marked this pull request as ready for review June 30, 2026 17:08

peterxcli reviewed Jul 1, 2026

View reviewed changes

ivandika3 requested a review from szetszwo July 2, 2026 01:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-15059. Shift streaming write sortDatanodes logic to OM#10633

HDDS-15059. Shift streaming write sortDatanodes logic to OM#10633
chihsuan wants to merge 10 commits into
apache:masterfrom
chihsuan:HDDS-15059

chihsuan commented Jun 29, 2026 •

edited

Loading

Uh oh!

peterxcli left a comment •

edited

Loading

Uh oh!

ivandika3 commented Jul 2, 2026 •

edited

Loading

Uh oh!

szetszwo commented Jul 2, 2026

Uh oh!

ivandika3 commented Jul 3, 2026 •

edited

Loading

Uh oh!

szetszwo commented Jul 3, 2026

Uh oh!

ivandika3 commented Jul 4, 2026

Uh oh!

chihsuan commented Jul 4, 2026

Uh oh!

ivandika3 commented Jul 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

chihsuan commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

peterxcli left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivandika3 commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szetszwo commented Jul 2, 2026

Uh oh!

ivandika3 commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szetszwo commented Jul 3, 2026

Uh oh!

ivandika3 commented Jul 4, 2026

Uh oh!

chihsuan commented Jul 4, 2026

Uh oh!

ivandika3 commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chihsuan commented Jun 29, 2026 •

edited

Loading

peterxcli left a comment •

edited

Loading

ivandika3 commented Jul 2, 2026 •

edited

Loading

ivandika3 commented Jul 3, 2026 •

edited

Loading

ivandika3 commented Jul 4, 2026 •

edited

Loading