This is related to #1247 but I'm filing a new issue since I have more data now and different questions. We've been running our "high throughput" queues with Synchronous=false but it's not clear what the disadvantage is of always setting it to true.
We don't usually see any failed acks (with true or false) but this time in particular we were catching up on a subscription that was >9 million and >2 hours behind. These "expired" are making it almost impossible to catch up since at some points over half of the messages we ack are failing and being retried.
I'm not sure if there's an issue where we're receiving already expired messages or if the client is holding onto messages too long (since we have Synchronous=false). Though I raised the Ack Deadline in the Google Console to 60 seconds and I didn't see any change (19:30 the change was made relative to the graphs below) so I'm inclined to think something is wrong here. I'm also not sure if we should just be using Synchronous=true instead?
Client
PubSub (aef6eeb)
Describe Your Environment
CentOS 7 on GCE (specifically in us-east1)
2 workers in each region with
MaxOutstandingMessages = 2000
Timeout = 15*time.Second
NumConsumers = 2000
NumGoroutines = 20 (10 x CPUs)
Synchronous = false
Expected Behavior
Acks succeed and we're not losing half of our ack's to "expired" errors.
Actual Behavior
We're seeing thousands of messages failing ack with the error "expired":

In #1247 (comment) you mentioned that if Synchronous=false then the client fetches more than the MaxOutstandingMessages. During that same time, looking at the pubsub_pull_count OpenCensus metric compared to our internal count of acks shows that we're acking all of the pulled messages:

Does the pubsub_pull_count not include the count "extra" messages that are pulled? If not, how can we determine that and graph it to aide with debugging.
This subscription in particular uses more CPU if a job is duplicated so more duplications cause the CPU to spike and for jobs to take longer to ack. The 95th percentile of time it takes to ack is < 5 seconds so I would imagine that even if the client fetched 2x MaxOutstandingMessages we could still ack all of them before the deadline (even with no ModAcks).

This is related to #1247 but I'm filing a new issue since I have more data now and different questions. We've been running our "high throughput" queues with
Synchronous=falsebut it's not clear what the disadvantage is of always setting it to true.We don't usually see any failed acks (with true or false) but this time in particular we were catching up on a subscription that was >9 million and >2 hours behind. These "expired" are making it almost impossible to catch up since at some points over half of the messages we ack are failing and being retried.
I'm not sure if there's an issue where we're receiving already expired messages or if the client is holding onto messages too long (since we have
Synchronous=false). Though I raised the Ack Deadline in the Google Console to 60 seconds and I didn't see any change (19:30 the change was made relative to the graphs below) so I'm inclined to think something is wrong here. I'm also not sure if we should just be usingSynchronous=trueinstead?Client
PubSub (aef6eeb)
Describe Your Environment
CentOS 7 on GCE (specifically in us-east1)
2 workers in each region with
Expected Behavior
Acks succeed and we're not losing half of our ack's to "expired" errors.
Actual Behavior
We're seeing thousands of messages failing ack with the error "expired":

In #1247 (comment) you mentioned that if

Synchronous=falsethen the client fetches more than theMaxOutstandingMessages. During that same time, looking at thepubsub_pull_countOpenCensus metric compared to our internal count of acks shows that we're acking all of the pulled messages:Does the
pubsub_pull_countnot include the count "extra" messages that are pulled? If not, how can we determine that and graph it to aide with debugging.This subscription in particular uses more CPU if a job is duplicated so more duplications cause the CPU to spike and for jobs to take longer to ack. The 95th percentile of time it takes to ack is < 5 seconds so I would imagine that even if the client fetched 2x

MaxOutstandingMessageswe could still ack all of them before the deadline (even with no ModAcks).