Skip to content

pubsub: Many acks/modacks could cause other acks/modacks to be delayed #9727

@jameshartig

Description

@jameshartig

Client

PubSub

Environment

CentOS on GCE (t2d-standard-4)

Go Environment

NumGoroutines=40
MaxExtension=15s
MaxOutstandingMessages=40000
MaxOutstandingBytes=-1
Synchronous=false

Expected behavior

I expect to see 0 in subscription/expired_ack_deadlines_count assuming that my AckCount match my PullCount.

Actual behavior

We periodically see a huge rate of expired acks as high as 15k/s. We are currently acknowledging 20k messages per second across 2 GCE instances, ~10k/s per instance, and pulling 20k messages per second across those instances as well. I would expect then that we shouldn't really be seeing any expired Acks.

I don't know the actual distribution of messages across the 40 goroutines but if some of them are getting most of the messages then it's possible for the ackIDBatchSize to be exceeded. When it's exceeded, sendModAck and sendAck both loop internally until all of the ackIDs have been sent. We don't have visibility into the distribution of time it takes to Acknowledge 2500 ackIDs but we can see from the GCP console that the Acknowledge method has a 95th percentile latency of over 100ms. Separately, we are calling ModifyAckDeadline (at the 95th percentile latency takes 40ms) with 16k IDs per second which needs 7 calls per instance which could take more than 250ms+.

Either of those would end up delaying the other since there's only a single sender goroutine which could be contributing to our expired acks issue.

Additionally, since we aren't using exactly-once-delivery, there's no way for us to measure how long it took from when we called Ack to when the request was sent to Pub/Sub. One way to fix that would be if the *AckResult returned from AckWithResult would actually be Ready once the message is sent, even if you're not using exactly-once-delivery.

Screenshots

image
image

Something else that's interesting is that the latencies shown in the GCP Console do not match our application-level metrics (which measure from start of Receive callback to Ack/Nack function call) at all:
image
vs
image

This is what led us to investigate if there was some sort of delay between when we Ack a message and when the underlying Acknowledge is sent by the client.

ModAckCount
image
(the reason for the increase at 19:30 UTC is because we increased MaxOutstandingMessages from 25000 to 40000)

Finally, the increase in expired acks happened after a sharp decrease in the StreamingPull response for which I have no explanation unless some change was made on Pub/Sub's side. It's not clear if this might mean that there's a higher concentration of messages in individual goroutines.
image

Additional context

We don't have any visibility into the 99th percentile modack extension being used and that would have been helpful in debugging.

Metadata

Metadata

Assignees

Labels

api: pubsubIssues related to the Pub/Sub API.status: investigatingThe issue is under investigation, which is determined to be non-trivial.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions