Spark failed to connect to the MetaStore Server

The problem You may encounter errors like this when running a Spark script / application: Solutions If you do not need the MetaStore server, there are two ways to disable it. Please note that Spark version >= 2.x is required. The first way (via spark2-submit parameters) The second way (via SparkConf object) Java example: Scala […]

READ MORE

Spark custom parquet OutputCommiter

If you need to make your own implementation of OutputCommiter for spark parquet-output tasks, then first of all you need to make a class that extends from org.apache.hadoop.mapreduce.OutputCommiter: Further, regardless of implementation of the OutputCommiter, you need to register the full class name in the Hadoop (!) configuration of Spark like this: Example for Scala: […]

READ MORE

Spark concurrent write to same HDFS path

The problem Sometimes you need to run such a scenario when several Spark tasks write data along the same path to HDFS. During the execution of tasks, you may encounter some errors: Suppose we have one Spark task, that writes to the hdfs://data/test directory. At runtime, Spark will make a temporary directory: hdfs://data/test/_temporary/0. There is […]

READ MORE

Scala + Maven + IntelliJ IDEA project setup

To create a Scala project using Maven to manage dependencies in IntelliJ IDEA, you first need to install the Scala plugin: File -> Preferences (Settings) -> Plugins Search Scala in the Marketplace Install it and restart IntelliJ IDEA Next, let’s create a regular Java + Maven project: File -> New -> Project Select Maven -> […]

READ MORE