How to find out what is taking up disk space in Zookeeper

If you came to this page, it means that ZooKeeper began to take up a lot of space, and you need to find the reason.

Mostly ephemeral data is stored in ZooKeeper. So at some time you asks the question – why the server began to consume so much disk space?

The fact is that ZooKeeper creates snapshots, where it stores the data history. Over time, there are more and more of them.

You can delete some of the old snapshots and clear disk space. There is a zkCleanup.sh script for this: https://github.com/apache/zookeeper/blob/master/bin/zkCleanup.sh

Important note: the script must be run from the folder where it is installed. You can find the folder using this command:

locate zkCleanup.sh

For Cloudera, this folder will look like this:

/opt/cloudera/parcels/CDH-<version>/lib/zookeeper/bin

If you don’t have such a file, it’s not a problem. Let’s look at the main part of the script. There is a line:

org.apache.zookeeper.server.PurgeTxnLog "$ZOODATADIR" $*

Command line arguments must be substituted for $*. The help for PurgeTxnLog says to set the -n parameter, which means the number of snapshots to be saved (greater than or equal to 3):

org.apache.zookeeper.server.PurgeTxnLog "/path/to/zookeeper/data" -n 3

The class must be run using Java, specifying the classpath:

java -cp "$CLASSPATH" org.apache.zookeeper.server.PurgeTxnLog "/path/to/zookeeper/data" -n 3

To set the CLASSPATH variable, you need to run the script:

ZOOBINDIR="<zookeeper_distr>/bin"
. "$ZOOBINDIR"/zkEnv.sh

If the Zookeeper distribution comes with Kafka, then the launch will look like this:

<zookeeper_distr>/bin/kafka-run-class org.apache.zookeeper.server.PurgeTxnLog "<zookeeper_distr>/data/zookeeper-data" -n 3
Telegram channel

If you still have any questions, feel free to ask me in the comments under this article or write me at promark33@gmail.com.

If I saved your day, you can support me 🤝

Leave a Reply

Your email address will not be published. Required fields are marked *