If you need to make your own implementation of OutputCommiter
for spark parquet-output tasks, then first of all you need to make a class that extends from org.apache.hadoop.mapreduce.OutputCommiter
:
public class YourOutputCommiter extends OutputCommiter {
...
}
Further, regardless of implementation of the OutputCommiter
, you need to register the full class name in the Hadoop (!) configuration of Spark like this:
javaSparkContext.hadoopConfiguration().set("spark.sql.parquet.output.committer.class", YourOutputCommiter.class.getCanonicalName());
Example for Scala:
session.sparkContext.hadoopConfiguration.set("spark.sql.parquet.output.committer.class", "your.package.YourOutputCommiter")
An example of the complete JavaSparkContext
creation process:
private JavaSparkContext createJavaSparkContext() {
SparkConf conf = new SparkConf()
.setAppName("your-app-name")
.set("your.other.spark.parameters", "your.other.spark.values");
JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
javaSparkContext.setCheckpointDir("/tmp");
javaSparkContext.hadoopConfiguration().set("spark.sql.parquet.output.committer.class", YourOutputCommiter.class.getCanonicalName());
return javaSparkContext;
}
If you still have any questions, feel free to ask me in the comments under this article, or write me on promark33@gmail.com.