Skip to content

Count the connection failure as the condition of quarantine#4727

Open
zymap wants to merge 2 commits intoapache:masterfrom
zymap:count-connection-failure-as-quarantine-condition
Open

Count the connection failure as the condition of quarantine#4727
zymap wants to merge 2 commits intoapache:masterfrom
zymap:count-connection-failure-as-quarantine-condition

Conversation

@zymap
Copy link
Member

@zymap zymap commented Mar 12, 2026


Motivation

Currently, the bookie client quarantine mechanism primarily triggers based on read and write error responses from Bookies. However, in multi-region deployments, a common failure mode is the Network Partition or DNS Resolution Failure at the Region level.

In such scenarios:

  1. A Bookie remains registered in ZooKeeper (it can still heartbeat to its local ZK observer).
  2. The Client (Broker) cannot resolve the Bookie's IP or establish a TCP connection.
  3. The EnsemblePlacementPolicy (especially RegionAwareEnsemblePlacementPolicy) sees the Bookie as "Available" and repeatedly selects it to satisfy minRack or E/Qw constraints.
  4. The LedgerHandle fails to write because it cannot initialize a connection handle, triggering an Ensemble Change.
  5. Because the connection failure didn't trigger a quarantine, the placement policy picks the same problematic Bookie again in the next iteration.

This creates an infinite Ensemble Change loop, causing the Ledger write to hang indefinitely and bloating the Ledger metadata in ZooKeeper with thousands of segments.

zymap added 2 commits March 12, 2026 14:27
---

### Motivation

Currently, the BookieClient quarantine mechanism primarily triggers based on read and write error responses from Bookies. However, in multi-region deployments, a common failure mode is the Network Partition or DNS Resolution Failure at the Region level.

In such scenarios:

A Bookie remains registered in ZooKeeper (it can still heartbeat to its local ZK observer).

The Client (Broker) cannot resolve the Bookie's IP or establish a TCP connection.

The EnsemblePlacementPolicy (especially RegionAwareEnsemblePlacementPolicy) sees the Bookie as "Available" and repeatedly selects it to satisfy minRack or E/Qw constraints.

The LedgerHandle fails to write because it cannot initialize a connection handle, triggering an Ensemble Change.

Because the connection failure didn't trigger a quarantine, the placement policy picks the same problematic Bookie again in the next iteration.

This creates an infinite Ensemble Change loop, causing the Ledger write to hang indefinitely and bloating the Ledger metadata in ZooKeeper with thousands of segments.
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggesting a new name for the config. It seems that getBoolean would throw an exception if the setting is missing and when the default value isn't provided.

protected static final String BOOKIE_ERROR_THRESHOLD_PER_INTERVAL = "bookieErrorThresholdPerInterval";
protected static final String BOOKIE_QUARANTINE_TIME_SECONDS = "bookieQuarantineTimeSeconds";
protected static final String BOOKIE_QUARANTINE_RATIO = "bookieQuarantineRatio";
protected static final String BOOKIE_CONNECTING_ERROR_COUNTED_INTO_QUARANTINE = "bookieConnectingErrorCountedIntoQuarantine";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we call this bookieConnectionErrorQuarantineEnabled instead? The "counted into" part is slightly confusing.

Suggested change
protected static final String BOOKIE_CONNECTING_ERROR_COUNTED_INTO_QUARANTINE = "bookieConnectingErrorCountedIntoQuarantine";
protected static final String BOOKIE_CONNECTING_ERROR_QUARANTINE_ENABLED = "bookieConnectionErrorQuarantineEnabled";

* @return
*/
public boolean getBookieConnectingErrorCountedIntoQuarantine() {
return getBoolean(BOOKIE_CONNECTING_ERROR_COUNTED_INTO_QUARANTINE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default value should be provided as the 2nd argument, for example getBoolean(BOOKIE_CONNECTING_ERROR_QUARANTINE_ENABLED, false) (example with the suggested name of the config)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants