pubsub: Many acks/modacks could cause other acks/modacks to be delayed

**Client**

PubSub 

**Environment**

CentOS on GCE (t2d-standard-4)

**Go Environment**

```
NumGoroutines=40
MaxExtension=15s
MaxOutstandingMessages=40000
MaxOutstandingBytes=-1
Synchronous=false
```

**Expected behavior**

I expect to see 0 in [subscription/expired_ack_deadlines_count](https://cloud.google.com/monitoring/api/metrics_gcp#pubsub/subscription/expired_ack_deadlines_count) assuming that my AckCount match my PullCount.

**Actual behavior**

We periodically see a huge rate of expired acks as high as 15k/s. We are currently acknowledging 20k messages per second across 2 GCE instances, ~10k/s per instance, and pulling 20k messages per second across those instances as well. I would expect then that we shouldn't really be seeing any expired Acks.

I don't know the actual distribution of messages across the 40 goroutines but if some of them are getting most of the messages then it's possible for the `ackIDBatchSize` to be exceeded. When it's exceeded, `sendModAck` and `sendAck` both loop internally until all of the ackIDs have been sent. We don't have visibility into the distribution of time it takes to `Acknowledge` 2500 ackIDs but we can see from the GCP console that the `Acknowledge` method has a 95th percentile latency of over 100ms. Separately, we are calling `ModifyAckDeadline` (at the 95th percentile latency takes 40ms) with 16k IDs per second which needs 7 calls per instance which could take more than 250ms+.

Either of those would end up delaying the other since there's only a single `sender` goroutine which could be contributing to our expired acks issue.

Additionally, since we aren't using exactly-once-delivery, there's no way for us to measure how long it took from when we called `Ack` to when the request was sent to Pub/Sub. One way to fix that would be if the `*AckResult` returned from `AckWithResult` would actually be `Ready` once the message is sent, even if you're not using exactly-once-delivery.

**Screenshots**

![image](https://github.com/googleapis/google-cloud-go/assets/112555/de124b30-e41e-4407-892e-82a2d713253a)
![image](https://github.com/googleapis/google-cloud-go/assets/112555/fabd40e2-44aa-4682-a6a9-6dc1aff84e6e)

Something else that's interesting is that the latencies shown in the GCP Console do not match our application-level metrics (which measure from start of Receive callback to Ack/Nack function call) at all:
![image](https://github.com/googleapis/google-cloud-go/assets/112555/08b6c26d-ce89-43e9-8739-2c5caab81012)
vs
![image](https://github.com/googleapis/google-cloud-go/assets/112555/5c3b6ff3-e97a-4ef1-ac8e-157ad209f664)

This is what led us to investigate if there was some sort of delay between when we `Ack` a message and when the underlying `Acknowledge` is sent by the client.

ModAckCount
![image](https://github.com/googleapis/google-cloud-go/assets/112555/f66e9afc-98d5-45bb-9a74-af9410b9979c)
(the reason for the increase at 19:30 UTC is because we increased MaxOutstandingMessages from 25000 to 40000)

Finally, the increase in expired acks happened after a sharp decrease in the StreamingPull response for which I have no explanation unless some change was made on Pub/Sub's side. It's not clear if this might mean that there's a higher concentration of messages in individual goroutines.
![image](https://github.com/googleapis/google-cloud-go/assets/112555/70982316-812e-4b6d-9027-772dfb2c7bfb)

**Additional context**

We don't have any visibility into the 99th percentile modack extension being used and that would have been helpful in debugging.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pubsub: Many acks/modacks could cause other acks/modacks to be delayed #9727

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

pubsub: Many acks/modacks could cause other acks/modacks to be delayed #9727

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions