Spark Java access remote HDFS

Suppose we need to work with different HDFS (clusterB, for instance) from our Spark Java application, running on clusterA.

Firstly, you need to add --conf key to your run command. Depends on Spark version:

  • (Spark 1.x-2.1) spark.yarn.access.namenodes=hdfs://clusterA,hdfs://clusterB
  • (Spark 2.2+) spark.yarn.access.hadoopFileSystems=hdfs://clusterA,hdfs://clusterB

Secondly, when you creating Spark’s Java context, add that:

javaSparkContext.hadoopConfiguration().addResource(new Path("core-site-clusterB.xml"));
javaSparkContext.hadoopConfiguration().addResource(new Path("hdfs-site-clusterB.xml"));

You need to go to clusterB and gather core-site.xml and hdfs-site.xml from there (default location for Cloudera is /etc/hadoop/conf) and put near your app running in clusterA.

Pay attention for that points:

  • we are specifying core-site.xml and hdfs-site.xml, not just one of them
  • we are sending Path object to addResource() method, not just ordinary String!

Troubleshooting

If you facing java.net.UnknownHostException: clusterB, then try to put full namenode address of your remote HDFS with port (instead of hdfs/cluster short name) to --conf into your running command:

  • (Spark 1.x-2.1) spark.yarn.access.namenodes=hdfs://clusterA,hdfs://namenode.fqdn:port
  • (Spark 2.2+) spark.yarn.access.hadoopFileSystems=hdfs://clusterA,hdfs://namenode.fqdn:port

If you facing AccessControlException error, check this: https://mchesnavsky.tech/sparks-accesscontrolexception-permission-denied/.

If it’s not working for you, or you facing with another error, check “Troubleshooting” section in my recent article: https://mchesnavsky.tech/java-access-remote-hdfs-from-current-hadoop-cluster.

Additional thanks

Telegram channel

If you still have any questions, feel free to ask me in the comments under this article or write me at promark33@gmail.com.

If I saved your day, you can support me 🤝