fix(firestore): enforce backpressure in BulkWriter (#12938)

bhshkh · web-flow · commit 820d0a244777 · 2026-04-30T16:55:57.000-04:00
This PR addresses a critical issue where the Firestore `BulkWriter` could **silently drop document writes** without notifying the caller, particularly under high load or when the process context was canceled. ## 🐛 Issue and Root Causes Investigation revealed that the current implementation of `BulkWriter` bypassed the client's built-in safety and resource management mechanisms: 1. **Disabled Backpressure:** All document writes were enqueued with a size of `0`, effectively disabling the `BufferedByteLimit` (https://pkg.go.dev/google.golang.org/api/support/bundler#Bundler) enforcement. This allowed the internal buffer to grow without bound, leading to memory pressure and potential Out-of-Memory (OOM) crashes. 2. **Ignored Queuing Errors:** The internal `write` function ignored return values from `bundler.Add`, meaning queuing failures were never reported to the user. ## ✅ Proposed Fix The fix moves `BulkWriter` to a managed resource model that respects backpressure and ensures loud failures: * **Runtime Size Calculation:** Computes the actual serialized size of each write using `proto.Size(w)`. * **Enforced Backpressure:** Replaces `Add(j, 0)` with `AddWait(ctx, j, estimatedSize)`. This ensures that the producer (application code) blocks if the internal 1GB buffer limit is reached, preventing unbounded memory growth. ## 📌 Benefits * **Data Integrity:** Guarantees that documents are either successfully queued or returned with an explicit error. * **System Stability:** Prevents OOM crashes by capping memory usage and slowing down producers that outpace the network. * **Alignment:** Brings the Go SDK into parity with the backpressure behavior found in other Firestore SDKs like Java and Node.js. #### Java Implementation The Java SDK uses an asynchronous "task" model to manage writes. * **Concurrency:** It leverages async threads (BulkCommitBatch) to handle parallel requests. * **Backpressure:** It implements a buffer limit on the number of pending operations to prevent memory exhaustion. When this limit is reached, subsequent attempts to queue writes will block the producer until space is available. #### Node.js Implementation Node.js follows a similar pattern but is optimized for its event-driven architecture. * **Buffering:** It automatically buffers writes into batches and ensures they are sent in order. * **Memory Management:** Similar to Java, it uses an internal buffer limit to impose backpressure on the event loop, preventing an unbounded queue of pending promises. #### Python Implementation The Python SDK is designed to be user-friendly by hiding the complexities of asynchronous execution. * **Parallelization:** It uses a `ThreadPoolExecutor` to send batches in parallel. This allows users to gain performance benefits without manually managing an event loop or using `asyncio`. * **Rate Limiting:** It includes a dedicated `RateLimiter` class to manage the ramp-up of write traffic. ## Impact Analysis The "breaking" change here is that Create might now block. However: * If the user's load is within normal limits, they won't notice a difference (the 1GB buffer is large). * If the user's load is excessive, they are already experiencing silent failures or OOMs. Blocking is the correct "fail-safe" state for their application's stability. * The BulkWriter methods already return an error. Returning a "context deadline exceeded" error from a blocking Create call is a valid and much more helpful response than returning nil and dropping the write. Fixes #11422
diff --git a/firestore/bulkwriter.go b/firestore/bulkwriter.go
@@ -27,6 +27,7 @@ import (
 	"google.golang.org/api/support/bundler"
 	"google.golang.org/grpc/codes"
 	"google.golang.org/grpc/status"
+	"google.golang.org/protobuf/proto"
 )
 
 const (
@@ -297,7 +298,6 @@ func (bw *BulkWriter) checkWriteConditions(doc *DocumentRef) error {
 
 // write packages up write requests into bulkWriterJob objects.
 func (bw *BulkWriter) write(w *pb.Write) (*BulkWriterJob, error) {
-
 	j := &BulkWriterJob{
 		resultChan: make(chan bulkWriterResult, 1),
 		write:      w,
@@ -307,8 +307,9 @@ func (bw *BulkWriter) write(w *pb.Write) (*BulkWriterJob, error) {
 	if err := bw.limiter.Wait(bw.ctx); err != nil {
 		return nil, err
 	}
-	err := bw.bundler.Add(j, 0)
-	if err != nil {
+
+	estimatedSize := proto.Size(w)
+	if err := bw.bundler.AddWait(bw.ctx, j, estimatedSize); err != nil {
 		return nil, err
 	}
 
@@ -360,7 +361,16 @@ func (bw *BulkWriter) send(i interface{}) {
 				// Do we need separate retry bundler?
 				_, isRetryable := batchWriteRetryCodes[codes.Code(s.Code)]
 				if j.attempts < maxRetryAttempts && isRetryable {
-					err := bw.bundler.Add(j, 0)
+					// Re-queue the job for retry. We use a size of 0 here for two reasons:
+					// 1. Consistency: Since the BulkWriter uses AddWait for backpressure,
+					//    we must continue using AddWait to avoid a "mixed methods" error from
+					//    the bundler.
+					// 2. Deadlock Prevention: The send() function runs within the bundler's
+					//    handler. The memory for this job was already accounted for during the
+					//    initial write() and will not be released until this handler returns.
+					//    Attempting to acquire additional weight here could cause a deadlock
+					//    if the buffer is full.
+					err := bw.bundler.AddWait(bw.ctx, j, 0)
 					if err != nil {
 						j.setError(fmt.Errorf("firestore: bulk write retry failed %w original error %v", err, status.Error(codes.Code(s.Code), s.Message)))
 					}