Skip to content

OptimizeBenchmarkPreCreateDocFlow#48393

Draft
xinlian12 wants to merge 19 commits intoAzure:mainfrom
xinlian12:optimizeBenchmarkPreCreateDocFlow
Draft

OptimizeBenchmarkPreCreateDocFlow#48393
xinlian12 wants to merge 19 commits intoAzure:mainfrom
xinlian12:optimizeBenchmarkPreCreateDocFlow

Conversation

@xinlian12
Copy link
Member

Summary

Replaced individual createItem calls with executeBulkOperations for document pre-population across all benchmark workloads. This leverages the SDK's built-in bulk executor which handles throttling and retries internally, resulting in simpler and more efficient pre-population.

Changes

Files modified

  • AsyncBenchmark.java — Replaced createItem loop + Flux.merge() with CosmosBulkOperations.getCreateItemOperation() + executeBulkOperations()
  • AsyncCtlWorkload.java — Same bulk pattern for createPrePopulatedDocs(), with success/failure tracking via CosmosBulkOperationResponse
  • AsyncEncryptionBenchmark.java — Same bulk pattern using cosmosEncryptionAsyncContainer.executeBulkOperations()
  • ReadMyWriteWorkflow.java — Bulk pre-population + full migration from internal Document/AsyncDocumentClient APIs to public PojoizedJson/CosmosAsyncContainer v4 APIs

Key design points

  • Documents are generated upfront, then bulk-created via executeBulkOperations() (same pattern as the existing DataLoader in the linkedin subpackage)
  • The bulk executor handles throttling (429) and transient error retries internally
  • Removed manual retry/conflict handling from pre-population code
  • ReadMyWriteWorkflow no longer depends on internal SDK types (AsyncDocumentClient, Document, QueryFeedOperationState, etc.)

Annie Liang and others added 2 commits March 11, 2026 19:10
Replaced individual createItem calls with executeBulkOperations for
document pre-population in AsyncBenchmark, AsyncCtlWorkload,
AsyncEncryptionBenchmark, and ReadMyWriteWorkflow.

Also migrated ReadMyWriteWorkflow from internal Document/AsyncDocumentClient
APIs to the public PojoizedJson/CosmosAsyncContainer v4 APIs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace pre-materialized List<CosmosItemOperation> with Flux.range().map()
to lazily emit operations on demand. This avoids holding all N operations
in memory simultaneously - the bulk executor consumes them as they are
generated, allowing GC to reclaim processed operation wrappers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12 xinlian12 changed the title Optimize benchmark pre create doc flow OptimizeBenchmarkPreCreateDocFlow Mar 12, 2026
Annie Liang and others added 3 commits March 11, 2026 20:15
If a bulk operation fails, fall back to individual createItem calls with
retry logic (max 5 retries for transient errors: 410, 408, 429, 500, 503)
and 409 conflict suppression. The retry helper is centralized in
BenchmarkHelper.retryFailedBulkOperations().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. HttpHeaders.set()/getHeader(): Add toLowerCaseIfNeeded() fast-path
   that skips String.toLowerCase() allocation when header name is already
   all-lowercase (common for x-ms-* and standard Cosmos headers).

2. RxGatewayStoreModel.getUri(): Build URI via StringBuilder instead of
   the 7-arg URI constructor which re-validates and re-encodes all
   components. Since components are already well-formed, the single-arg
   URI(String) constructor is sufficient and avoids URI$Parser overhead.

3. RxDocumentServiceRequest: Cache getCollectionName() result to avoid
   repeated O(n) slash-scanning across 14+ call sites per request
   lifecycle. Cache is invalidated when resourceAddress changes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The char-by-char scan added method call + branch overhead that offset
the toLowerCase savings. Profiling showed ConcurrentHashMap.get(),
HashMap.putVal(), and the scan loop itself caused ~10% throughput
regression. Reverting to original toLowerCase(Locale.ROOT) which the
JIT handles as an intrinsic.

The URI construction and collection name caching optimizations are
retained as they don't have this issue.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Annie Liang and others added 14 commits March 12, 2026 11:52
The JFR profiling showed URI$Parser.parse() consuming ~757 CPU samples
per 60s recording, all from RxGatewayStoreModel.getUri(). The root cause
was a String->URI->String round-trip: we built a URI string, parsed it
into java.net.URI (expensive), then Reactor Netty called .toASCIIString()
to convert it back to a String.

Changes:
- RxGatewayStoreModel.getUri() now returns String directly (no URI parse)
- HttpRequest: add uriString field with lazy URI parsing via uri()
- HttpRequest: new String-based constructor to skip URI parse entirely
- ReactorNettyClient: use request.uriString() instead of uri().toASCIIString()
- RxGatewayStoreModel: use uriString() for diagnostics/error paths
- URI is only parsed lazily on error paths that require a URI object

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add http2Enabled and http2MaxConcurrentStreams config options to
TenantWorkloadConfig. When http2Enabled=true, configures
Http2ConnectionConfig on GatewayConnectionConfig for AsyncBenchmark,
AsyncCtlWorkload, and AsyncEncryptionBenchmark.

Usage in workload JSON config:
  "http2Enabled": true,
  "http2MaxConcurrentStreams": 30

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…aults

Add missing cases in applyField switch statement so these fields are
properly inherited from tenantDefaults, not only from individual tenant
entries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Ensures every @JsonProperty field in TenantWorkloadConfig has a
corresponding case in the applyField() switch statement. This prevents
future fields from silently failing to inherit from tenantDefaults,
which was the root cause of the http2Enabled bug.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously, every Gateway response copied ALL Netty response headers
through a 3-step chain:
  1. Netty headers → HttpHeaders (toLowerCase + new HttpHeader per entry)
  2. HttpHeaders.toLowerCaseMap() → new HashMap<String,String>
  3. StoreResponse constructor → String[] arrays

Now the flow is:
  1. Netty headers → Map<String,String> directly (single toLowerCase pass)
  2. StoreResponse constructor → String[] arrays

Changes:
- HttpResponse: add headerMap() returning Map<String,String> directly
- ReactorNettyHttpResponse: override headerMap() to build lowercase map
  from Netty headers without intermediate HttpHeaders object
- HttpTransportSerializer: unwrapToStoreResponse takes Map<String,String>
  instead of HttpHeaders
- RxGatewayStoreModel: use httpResponse.headerMap() instead of headers()
- ThinClientStoreModel: pass response.getHeaders().asMap() directly
  instead of wrapping in new HttpHeaders()

This eliminates per-response: ~20 HttpHeader object allocations,
~20 extra toLowerCase calls, and one intermediate HashMap.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
StoreResponse now stores the response headers Map<String,String>
directly instead of converting to parallel String[] arrays. This
eliminates a redundant copy since RxDocumentServiceResponse and
StoreClient were immediately converting back to Map.

Before: Map → String[] + String[] → Map (3 allocations, 2 iterations)
After:  Map shared directly (0 extra allocations, 0 extra iterations)

Also upgrades StoreResponse.getHeaderValue() from O(n) linear scan
to O(1) HashMap.get() with case-insensitive fallback.

Null header values from Netty are skipped (matching old HttpHeaders.set
behavior which removed null entries).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The new toArray(new String[0]) calls in getResponseHeaderNames() and
getResponseHeaderValues() created garbage arrays on every call. These
methods have zero production callers — only test validators used them.

Changes:
- Mark getResponseHeaderNames/Values as @deprecated
- Update StoreResponseValidator to use getResponseHeaders() map directly
  instead of converting to arrays and doing indexOf lookups

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Revert the headerMap() direct-from-Netty path because the per-header
toLowerCase() calls caused a throughput regression vs v4. The JIT
optimizes the existing HttpHeaders.set() + toLowerCaseMap() path better.

Kept improvements:
- StoreResponse stores Map<String,String> directly (no String[] arrays)
- RxDocumentServiceResponse shares the Map reference (no extra copy)
- StoreClient uses getResponseHeaders() directly (no Map reconstruction)
- StoreResponse.getHeaderValue() uses HashMap.get() instead of O(n) scan
- unwrapToStoreResponse calls toLowerCaseMap() once, reuses the Map for
  both validateOrThrow and StoreResponse construction

Net effect vs v4: eliminates the Map→String[]→Map round-trip while
preserving the JIT-optimized HttpHeaders copy path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Netty's HttpObjectDecoder starts with a 256-byte buffer for header
parsing and resizes via ensureCapacityInternal() as headers grow.
Cosmos responses have ~2-4KB of headers, triggering multiple resizes.

Pre-sizing to 16KB (16384 bytes) avoids the resize overhead at the
cost of ~16KB per connection (negligible vs connection pool size).

JFR v6 showed AbstractStringBuilder.ensureCapacityInternal at 248
samples (1.6% CPU).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Revert all header copy chain changes (R3/v5/v6/v7) back to the v4
state which had the best throughput. Only addition on top of v4 is
initialBufferSize(16384) to pre-size Netty's header parsing buffer
and reduce ensureCapacityInternal() resize overhead.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Benchmark showed initialBufferSize change also produced regression.
Reverting to pure v4 state (URI elimination + collection name cache)
which had the best throughput at 2,421 ops/s.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants