Skip to content

events_plugin_ingestion backlog growing and redpanda filling disk #57573

@oddgarden6465

Description

@oddgarden6465

In what situation are you experiencing subpar performance?

I’m running a self-hosted PostHog hobby deployment on EC2 using the Docker Compose setup.

The disk usage is growing quickly, and after inspecting Docker volumes, Redpanda/Kafka seems to be the main source of disk usage, especially the events_plugin_ingestion topic.

The largest volumes/topics I found were:

app_redpanda-data       61G
app_zookeeper-datalog   21G
app_clickhouse-data     8.1G
app_objectstorage       1.6G
app_postgres-data       146M

Redpanda topic disk usage:

/var/lib/docker/volumes/app_redpanda-data/_data/kafka/events_plugin_ingestion = 54G
/var/lib/docker/volumes/app_redpanda-data/_data/kafka/clickhouse_events_json = 4.7G
/var/lib/docker/volumes/app_redpanda-data/_data/kafka = 60G

After reading the ingestion pipeline docs, my understanding is that the flow is roughly:

capture
  -> events_plugin_ingestion
  -> CDP / plugin ingestion worker
  -> person processing / event processing
  -> clickhouse_events_json
  -> ClickHouse

It looks like the bottleneck is before clickhouse_events_json, likely inside the plugins / CDP ingestion worker path.

The clickhouse-ingestion consumer group is lagging heavily on events_plugin_ingestion.

Earlier:

GROUP: clickhouse-ingestion
TOPIC: events_plugin_ingestion
CURRENT-OFFSET: 7299987
LOG-END-OFFSET: 16199449
TOTAL-LAG: 8899462

Later:

GROUP: clickhouse-ingestion
TOPIC: events_plugin_ingestion
CURRENT-OFFSET: 7403487
LOG-END-OFFSET: 17240198
TOTAL-LAG: 9836711

So the lag is increasing over time.

The consumer host maps to the plugins container:

8910a783475c -> app-plugins-1

app-plugins-1 stats:

CPU: ~3%
RAM: ~621MiB / 15.3GiB
NET I/O: 639GB / 528GB
BLOCK I/O: 7.21GB / 179MB
PIDs: 190

Other relevant containers were not obviously CPU/RAM saturated:

app-db-1          CPU ~4%, RAM ~93MiB
app-clickhouse-1  CPU ~8%, RAM ~1.48GiB
app-kafka-1       CPU ~9%, RAM ~1.8GiB
app-plugins-1     CPU ~3%, RAM ~621MiB

app-plugins-1 logs show slow batches:

Slow batch: Processed 500 events in 12.61s, groupId: clickhouse-ingestion
Slow batch: Processed 1 events in 11.04s, groupId: cdp-person-updates-consumer

I suspect person processing may be involved because of the cdp-person-updates-consumer slow batch logs, but I’m not sure.

The Redpanda topic configuration is:

events_plugin_ingestion:
  partitions: 1
  replicas: 1
  retention.ms: 604800000
  retention.bytes: -1
  segment.bytes: 134217728

clickhouse_events_json:
  partitions: 1
  replicas: 1
  retention.ms: 604800000
  retention.bytes: -1

My main questions are:

  1. Is events_plugin_ingestion intentionally created with only 1 partition in the hobby Docker Compose deployment?
  2. Is scaling the plugins service expected to improve throughput for clickhouse-ingestion, or is it limited by the single partition?
  3. Is it safe/recommended to increase partitions for events_plugin_ingestion in a self-hosted hobby deployment?
  4. Are there supported environment variables or configuration options to increase CDP/plugin ingestion throughput?
  5. Are there recommended Redpanda retention settings for hobby deployments? The current topic has 7-day retention and retention.bytes=-1.
  6. If I reduce retention.ms or set retention.bytes, can unprocessed events be dropped while clickhouse-ingestion lag is high?
  7. What is the recommended way to recover from a large backlog like ~10M messages?
  8. Could this backlog affect batch exports because recent events have not reached ClickHouse yet?

Any guidance on the supported way to scale this ingestion worker or tune this setup would be appreciated.

How to reproduce

  1. Run a self-hosted PostHog hobby Docker Compose deployment with sustained event ingestion.
  2. Inspect Redpanda/Kafka disk usage:
sudo du -xh --max-depth=2 /var/lib/docker/volumes/app_redpanda-data/_data/kafka | sort -h | tail -40
  1. Check consumer lag:
docker exec app-kafka-1 rpk group describe clickhouse-ingestion --brokers localhost:9092
  1. Check topic configuration:
docker exec app-kafka-1 rpk topic describe events_plugin_ingestion --brokers localhost:9092
docker exec app-kafka-1 rpk topic describe clickhouse_events_json --brokers localhost:9092
  1. Map the consumer host to the Docker container:
docker inspect -f '{{.Name}} {{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' $(docker ps -q)
  1. Check app-plugins-1 logs:
docker logs --tail=3000 app-plugins-1 | grep -Ei 'slow batch|person|transform|hog|destination|clickhouse|error|timeout|failed|exception'

Environment

  • PostHog Cloud
  • PostHog self-hosted with Kubernetes (deprecated, see "Sunsetting Kubernetes support"), version/commit: [please provide]
  • PostHog self-hosted hobby deployment

Additional context

I’m trying to understand whether this is an expected limitation of the hobby deployment, a configuration issue, or a possible ingestion throughput bug/regression.

The key concern is that events_plugin_ingestion has only one partition, so adding another plugins worker may not help if the clickhouse-ingestion consumer group can only actively consume one partition.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceHas to do with performance. For PRs, runs the clickhouse query performance suite

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions