Skip to content

pubsub: Old messages fail to ack because "expired" #1485

@jameshartig

Description

@jameshartig

This is related to #1247 but I'm filing a new issue since I have more data now and different questions. We've been running our "high throughput" queues with Synchronous=false but it's not clear what the disadvantage is of always setting it to true.

We don't usually see any failed acks (with true or false) but this time in particular we were catching up on a subscription that was >9 million and >2 hours behind. These "expired" are making it almost impossible to catch up since at some points over half of the messages we ack are failing and being retried.

I'm not sure if there's an issue where we're receiving already expired messages or if the client is holding onto messages too long (since we have Synchronous=false). Though I raised the Ack Deadline in the Google Console to 60 seconds and I didn't see any change (19:30 the change was made relative to the graphs below) so I'm inclined to think something is wrong here. I'm also not sure if we should just be using Synchronous=true instead?

Client

PubSub (aef6eeb)

Describe Your Environment

CentOS 7 on GCE (specifically in us-east1)
2 workers in each region with

MaxOutstandingMessages = 2000
Timeout = 15*time.Second
NumConsumers = 2000
NumGoroutines = 20 (10 x CPUs)
Synchronous = false

Expected Behavior

Acks succeed and we're not losing half of our ack's to "expired" errors.

Actual Behavior

We're seeing thousands of messages failing ack with the error "expired":
image

In #1247 (comment) you mentioned that if Synchronous=false then the client fetches more than the MaxOutstandingMessages. During that same time, looking at the pubsub_pull_count OpenCensus metric compared to our internal count of acks shows that we're acking all of the pulled messages:
image

Does the pubsub_pull_count not include the count "extra" messages that are pulled? If not, how can we determine that and graph it to aide with debugging.

This subscription in particular uses more CPU if a job is duplicated so more duplications cause the CPU to spike and for jobs to take longer to ack. The 95th percentile of time it takes to ack is < 5 seconds so I would imagine that even if the client fetched 2x MaxOutstandingMessages we could still ack all of them before the deadline (even with no ModAcks).
image

Metadata

Metadata

Assignees

Labels

api: pubsubIssues related to the Pub/Sub API.type: questionRequest for information or clarification. Not an issue.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions