In what situation are you experiencing subpar performance?
I’m running a self-hosted PostHog hobby deployment on EC2 using the Docker Compose setup.
The disk usage is growing quickly, and after inspecting Docker volumes, Redpanda/Kafka seems to be the main source of disk usage, especially the events_plugin_ingestion topic.
The largest volumes/topics I found were:
app_redpanda-data 61G
app_zookeeper-datalog 21G
app_clickhouse-data 8.1G
app_objectstorage 1.6G
app_postgres-data 146M
Redpanda topic disk usage:
/var/lib/docker/volumes/app_redpanda-data/_data/kafka/events_plugin_ingestion = 54G
/var/lib/docker/volumes/app_redpanda-data/_data/kafka/clickhouse_events_json = 4.7G
/var/lib/docker/volumes/app_redpanda-data/_data/kafka = 60G
After reading the ingestion pipeline docs, my understanding is that the flow is roughly:
capture
-> events_plugin_ingestion
-> CDP / plugin ingestion worker
-> person processing / event processing
-> clickhouse_events_json
-> ClickHouse
It looks like the bottleneck is before clickhouse_events_json, likely inside the plugins / CDP ingestion worker path.
The clickhouse-ingestion consumer group is lagging heavily on events_plugin_ingestion.
Earlier:
GROUP: clickhouse-ingestion
TOPIC: events_plugin_ingestion
CURRENT-OFFSET: 7299987
LOG-END-OFFSET: 16199449
TOTAL-LAG: 8899462
Later:
GROUP: clickhouse-ingestion
TOPIC: events_plugin_ingestion
CURRENT-OFFSET: 7403487
LOG-END-OFFSET: 17240198
TOTAL-LAG: 9836711
So the lag is increasing over time.
The consumer host maps to the plugins container:
8910a783475c -> app-plugins-1
app-plugins-1 stats:
CPU: ~3%
RAM: ~621MiB / 15.3GiB
NET I/O: 639GB / 528GB
BLOCK I/O: 7.21GB / 179MB
PIDs: 190
Other relevant containers were not obviously CPU/RAM saturated:
app-db-1 CPU ~4%, RAM ~93MiB
app-clickhouse-1 CPU ~8%, RAM ~1.48GiB
app-kafka-1 CPU ~9%, RAM ~1.8GiB
app-plugins-1 CPU ~3%, RAM ~621MiB
app-plugins-1 logs show slow batches:
Slow batch: Processed 500 events in 12.61s, groupId: clickhouse-ingestion
Slow batch: Processed 1 events in 11.04s, groupId: cdp-person-updates-consumer
I suspect person processing may be involved because of the cdp-person-updates-consumer slow batch logs, but I’m not sure.
The Redpanda topic configuration is:
events_plugin_ingestion:
partitions: 1
replicas: 1
retention.ms: 604800000
retention.bytes: -1
segment.bytes: 134217728
clickhouse_events_json:
partitions: 1
replicas: 1
retention.ms: 604800000
retention.bytes: -1
My main questions are:
- Is
events_plugin_ingestion intentionally created with only 1 partition in the hobby Docker Compose deployment?
- Is scaling the
plugins service expected to improve throughput for clickhouse-ingestion, or is it limited by the single partition?
- Is it safe/recommended to increase partitions for
events_plugin_ingestion in a self-hosted hobby deployment?
- Are there supported environment variables or configuration options to increase CDP/plugin ingestion throughput?
- Are there recommended Redpanda retention settings for hobby deployments? The current topic has 7-day retention and
retention.bytes=-1.
- If I reduce
retention.ms or set retention.bytes, can unprocessed events be dropped while clickhouse-ingestion lag is high?
- What is the recommended way to recover from a large backlog like ~10M messages?
- Could this backlog affect batch exports because recent events have not reached ClickHouse yet?
Any guidance on the supported way to scale this ingestion worker or tune this setup would be appreciated.
How to reproduce
- Run a self-hosted PostHog hobby Docker Compose deployment with sustained event ingestion.
- Inspect Redpanda/Kafka disk usage:
sudo du -xh --max-depth=2 /var/lib/docker/volumes/app_redpanda-data/_data/kafka | sort -h | tail -40
- Check consumer lag:
docker exec app-kafka-1 rpk group describe clickhouse-ingestion --brokers localhost:9092
- Check topic configuration:
docker exec app-kafka-1 rpk topic describe events_plugin_ingestion --brokers localhost:9092
docker exec app-kafka-1 rpk topic describe clickhouse_events_json --brokers localhost:9092
- Map the consumer host to the Docker container:
docker inspect -f '{{.Name}} {{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' $(docker ps -q)
- Check
app-plugins-1 logs:
docker logs --tail=3000 app-plugins-1 | grep -Ei 'slow batch|person|transform|hog|destination|clickhouse|error|timeout|failed|exception'
Environment
Additional context
I’m trying to understand whether this is an expected limitation of the hobby deployment, a configuration issue, or a possible ingestion throughput bug/regression.
The key concern is that events_plugin_ingestion has only one partition, so adding another plugins worker may not help if the clickhouse-ingestion consumer group can only actively consume one partition.
In what situation are you experiencing subpar performance?
I’m running a self-hosted PostHog hobby deployment on EC2 using the Docker Compose setup.
The disk usage is growing quickly, and after inspecting Docker volumes, Redpanda/Kafka seems to be the main source of disk usage, especially the
events_plugin_ingestiontopic.The largest volumes/topics I found were:
Redpanda topic disk usage:
After reading the ingestion pipeline docs, my understanding is that the flow is roughly:
It looks like the bottleneck is before
clickhouse_events_json, likely inside theplugins/ CDP ingestion worker path.The
clickhouse-ingestionconsumer group is lagging heavily onevents_plugin_ingestion.Earlier:
Later:
So the lag is increasing over time.
The consumer host maps to the
pluginscontainer:app-plugins-1stats:Other relevant containers were not obviously CPU/RAM saturated:
app-plugins-1logs show slow batches:I suspect person processing may be involved because of the
cdp-person-updates-consumerslow batch logs, but I’m not sure.The Redpanda topic configuration is:
My main questions are:
events_plugin_ingestionintentionally created with only 1 partition in the hobby Docker Compose deployment?pluginsservice expected to improve throughput forclickhouse-ingestion, or is it limited by the single partition?events_plugin_ingestionin a self-hosted hobby deployment?retention.bytes=-1.retention.msor setretention.bytes, can unprocessed events be dropped whileclickhouse-ingestionlag is high?Any guidance on the supported way to scale this ingestion worker or tune this setup would be appreciated.
How to reproduce
docker exec app-kafka-1 rpk group describe clickhouse-ingestion --brokers localhost:9092app-plugins-1logs:Environment
Additional context
I’m trying to understand whether this is an expected limitation of the hobby deployment, a configuration issue, or a possible ingestion throughput bug/regression.
The key concern is that
events_plugin_ingestionhas only one partition, so adding anotherpluginsworker may not help if theclickhouse-ingestionconsumer group can only actively consume one partition.