Spark custom parquet OutputCommiter

If you need to make your own implementation of OutputCommiter for spark parquet-output tasks, then first of all you need to make a class that extends from org.apache.hadoop.mapreduce.OutputCommiter:

public class YourOutputCommiter extends OutputCommiter {
	...
}

Further, regardless of implementation of the OutputCommiter, you need to register the full class name in the Hadoop (!) configuration of Spark like this:

javaSparkContext.hadoopConfiguration().set("spark.sql.parquet.output.committer.class", YourOutputCommiter.class.getCanonicalName());

Example for Scala:

session.sparkContext.hadoopConfiguration.set("spark.sql.parquet.output.committer.class", "your.package.YourOutputCommiter")

An example of the complete JavaSparkContext creation process:

private JavaSparkContext createJavaSparkContext() {
    SparkConf conf = new SparkConf()
            .setAppName("your-app-name")
            .set("your.other.spark.parameters", "your.other.spark.values");
    JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
    javaSparkContext.setCheckpointDir("/tmp");
    javaSparkContext.hadoopConfiguration().set("spark.sql.parquet.output.committer.class", YourOutputCommiter.class.getCanonicalName());
    return javaSparkContext;
}
Telegram channel

If you still have any questions, feel free to ask me in the comments under this article or write me at promark33@gmail.com.

If I saved your day, you can support me 🤝

Leave a Reply

Your email address will not be published. Required fields are marked *