Client
PubSub
Environment
CentOS on GCE (t2d-standard-4)
Go Environment
NumGoroutines=40
MaxExtension=15s
MaxOutstandingMessages=40000
MaxOutstandingBytes=-1
Synchronous=false
Expected behavior
I expect to see 0 in subscription/expired_ack_deadlines_count assuming that my AckCount match my PullCount.
Actual behavior
We periodically see a huge rate of expired acks as high as 15k/s. We are currently acknowledging 20k messages per second across 2 GCE instances, ~10k/s per instance, and pulling 20k messages per second across those instances as well. I would expect then that we shouldn't really be seeing any expired Acks.
I don't know the actual distribution of messages across the 40 goroutines but if some of them are getting most of the messages then it's possible for the ackIDBatchSize to be exceeded. When it's exceeded, sendModAck and sendAck both loop internally until all of the ackIDs have been sent. We don't have visibility into the distribution of time it takes to Acknowledge 2500 ackIDs but we can see from the GCP console that the Acknowledge method has a 95th percentile latency of over 100ms. Separately, we are calling ModifyAckDeadline (at the 95th percentile latency takes 40ms) with 16k IDs per second which needs 7 calls per instance which could take more than 250ms+.
Either of those would end up delaying the other since there's only a single sender goroutine which could be contributing to our expired acks issue.
Additionally, since we aren't using exactly-once-delivery, there's no way for us to measure how long it took from when we called Ack to when the request was sent to Pub/Sub. One way to fix that would be if the *AckResult returned from AckWithResult would actually be Ready once the message is sent, even if you're not using exactly-once-delivery.
Screenshots


Something else that's interesting is that the latencies shown in the GCP Console do not match our application-level metrics (which measure from start of Receive callback to Ack/Nack function call) at all:

vs

This is what led us to investigate if there was some sort of delay between when we Ack a message and when the underlying Acknowledge is sent by the client.
ModAckCount

(the reason for the increase at 19:30 UTC is because we increased MaxOutstandingMessages from 25000 to 40000)
Finally, the increase in expired acks happened after a sharp decrease in the StreamingPull response for which I have no explanation unless some change was made on Pub/Sub's side. It's not clear if this might mean that there's a higher concentration of messages in individual goroutines.

Additional context
We don't have any visibility into the 99th percentile modack extension being used and that would have been helpful in debugging.
Client
PubSub
Environment
CentOS on GCE (t2d-standard-4)
Go Environment
Expected behavior
I expect to see 0 in subscription/expired_ack_deadlines_count assuming that my AckCount match my PullCount.
Actual behavior
We periodically see a huge rate of expired acks as high as 15k/s. We are currently acknowledging 20k messages per second across 2 GCE instances, ~10k/s per instance, and pulling 20k messages per second across those instances as well. I would expect then that we shouldn't really be seeing any expired Acks.
I don't know the actual distribution of messages across the 40 goroutines but if some of them are getting most of the messages then it's possible for the
ackIDBatchSizeto be exceeded. When it's exceeded,sendModAckandsendAckboth loop internally until all of the ackIDs have been sent. We don't have visibility into the distribution of time it takes toAcknowledge2500 ackIDs but we can see from the GCP console that theAcknowledgemethod has a 95th percentile latency of over 100ms. Separately, we are callingModifyAckDeadline(at the 95th percentile latency takes 40ms) with 16k IDs per second which needs 7 calls per instance which could take more than 250ms+.Either of those would end up delaying the other since there's only a single
sendergoroutine which could be contributing to our expired acks issue.Additionally, since we aren't using exactly-once-delivery, there's no way for us to measure how long it took from when we called
Ackto when the request was sent to Pub/Sub. One way to fix that would be if the*AckResultreturned fromAckWithResultwould actually beReadyonce the message is sent, even if you're not using exactly-once-delivery.Screenshots
Something else that's interesting is that the latencies shown in the GCP Console do not match our application-level metrics (which measure from start of Receive callback to Ack/Nack function call) at all:


vs
This is what led us to investigate if there was some sort of delay between when we
Acka message and when the underlyingAcknowledgeis sent by the client.ModAckCount

(the reason for the increase at 19:30 UTC is because we increased MaxOutstandingMessages from 25000 to 40000)
Finally, the increase in expired acks happened after a sharp decrease in the StreamingPull response for which I have no explanation unless some change was made on Pub/Sub's side. It's not clear if this might mean that there's a higher concentration of messages in individual goroutines.

Additional context
We don't have any visibility into the 99th percentile modack extension being used and that would have been helpful in debugging.