events_plugin_ingestion backlog growing and redpanda filling disk

## In what situation are you experiencing subpar performance?

I’m running a self-hosted PostHog hobby deployment on EC2 using the Docker Compose setup.

The disk usage is growing quickly, and after inspecting Docker volumes, Redpanda/Kafka seems to be the main source of disk usage, especially the `events_plugin_ingestion` topic.

The largest volumes/topics I found were:

```text
app_redpanda-data       61G
app_zookeeper-datalog   21G
app_clickhouse-data     8.1G
app_objectstorage       1.6G
app_postgres-data       146M
```

Redpanda topic disk usage:

```text
/var/lib/docker/volumes/app_redpanda-data/_data/kafka/events_plugin_ingestion = 54G
/var/lib/docker/volumes/app_redpanda-data/_data/kafka/clickhouse_events_json = 4.7G
/var/lib/docker/volumes/app_redpanda-data/_data/kafka = 60G
```

After reading the ingestion pipeline docs, my understanding is that the flow is roughly:

```text
capture
  -> events_plugin_ingestion
  -> CDP / plugin ingestion worker
  -> person processing / event processing
  -> clickhouse_events_json
  -> ClickHouse
```

It looks like the bottleneck is before `clickhouse_events_json`, likely inside the `plugins` / CDP ingestion worker path.

The `clickhouse-ingestion` consumer group is lagging heavily on `events_plugin_ingestion`.

Earlier:

```text
GROUP: clickhouse-ingestion
TOPIC: events_plugin_ingestion
CURRENT-OFFSET: 7299987
LOG-END-OFFSET: 16199449
TOTAL-LAG: 8899462
```

Later:

```text
GROUP: clickhouse-ingestion
TOPIC: events_plugin_ingestion
CURRENT-OFFSET: 7403487
LOG-END-OFFSET: 17240198
TOTAL-LAG: 9836711
```

So the lag is increasing over time.

The consumer host maps to the `plugins` container:

```text
8910a783475c -> app-plugins-1
```

`app-plugins-1` stats:

```text
CPU: ~3%
RAM: ~621MiB / 15.3GiB
NET I/O: 639GB / 528GB
BLOCK I/O: 7.21GB / 179MB
PIDs: 190
```

Other relevant containers were not obviously CPU/RAM saturated:

```text
app-db-1          CPU ~4%, RAM ~93MiB
app-clickhouse-1  CPU ~8%, RAM ~1.48GiB
app-kafka-1       CPU ~9%, RAM ~1.8GiB
app-plugins-1     CPU ~3%, RAM ~621MiB
```

`app-plugins-1` logs show slow batches:

```text
Slow batch: Processed 500 events in 12.61s, groupId: clickhouse-ingestion
Slow batch: Processed 1 events in 11.04s, groupId: cdp-person-updates-consumer
```

I suspect person processing may be involved because of the `cdp-person-updates-consumer` slow batch logs, but I’m not sure.

The Redpanda topic configuration is:

```text
events_plugin_ingestion:
  partitions: 1
  replicas: 1
  retention.ms: 604800000
  retention.bytes: -1
  segment.bytes: 134217728

clickhouse_events_json:
  partitions: 1
  replicas: 1
  retention.ms: 604800000
  retention.bytes: -1
```

My main questions are:

1. Is `events_plugin_ingestion` intentionally created with only 1 partition in the hobby Docker Compose deployment?
2. Is scaling the `plugins` service expected to improve throughput for `clickhouse-ingestion`, or is it limited by the single partition?
3. Is it safe/recommended to increase partitions for `events_plugin_ingestion` in a self-hosted hobby deployment?
4. Are there supported environment variables or configuration options to increase CDP/plugin ingestion throughput?
5. Are there recommended Redpanda retention settings for hobby deployments? The current topic has 7-day retention and `retention.bytes=-1`.
6. If I reduce `retention.ms` or set `retention.bytes`, can unprocessed events be dropped while `clickhouse-ingestion` lag is high?
7. What is the recommended way to recover from a large backlog like ~10M messages?
8. Could this backlog affect batch exports because recent events have not reached ClickHouse yet?

Any guidance on the supported way to scale this ingestion worker or tune this setup would be appreciated.

## How to reproduce

1. Run a self-hosted PostHog hobby Docker Compose deployment with sustained event ingestion.
2. Inspect Redpanda/Kafka disk usage:

```bash
sudo du -xh --max-depth=2 /var/lib/docker/volumes/app_redpanda-data/_data/kafka | sort -h | tail -40
```

3. Check consumer lag:

```bash
docker exec app-kafka-1 rpk group describe clickhouse-ingestion --brokers localhost:9092
```

4. Check topic configuration:

```bash
docker exec app-kafka-1 rpk topic describe events_plugin_ingestion --brokers localhost:9092
docker exec app-kafka-1 rpk topic describe clickhouse_events_json --brokers localhost:9092
```

5. Map the consumer host to the Docker container:

```bash
docker inspect -f '{{.Name}} {{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' $(docker ps -q)
```

6. Check `app-plugins-1` logs:

```bash
docker logs --tail=3000 app-plugins-1 | grep -Ei 'slow batch|person|transform|hog|destination|clickhouse|error|timeout|failed|exception'
```

## Environment

- [ ] PostHog Cloud
- [ ] PostHog self-hosted with Kubernetes (deprecated, see ["Sunsetting Kubernetes support"](https://posthog.com/blog/sunsetting-helm-support-posthog)), version/commit: [please provide]
- [x] PostHog self-hosted hobby deployment

## Additional context

I’m trying to understand whether this is an expected limitation of the hobby deployment, a configuration issue, or a possible ingestion throughput bug/regression.

The key concern is that `events_plugin_ingestion` has only one partition, so adding another `plugins` worker may not help if the `clickhouse-ingestion` consumer group can only actively consume one partition.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

events_plugin_ingestion backlog growing and redpanda filling disk #57573

In what situation are you experiencing subpar performance?

How to reproduce

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

events_plugin_ingestion backlog growing and redpanda filling disk #57573

Description

In what situation are you experiencing subpar performance?

How to reproduce

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions