Spark Java access remote HDFS

Suppose we need to work with different HDFS (clusterB, for instance) from our Spark Java application, running on clusterA.

Firstly, you need to add --conf key to your run command. Depends on Spark version:

  • (Spark 1.x-2.1) spark.yarn.access.namenodes=hdfs://clusterA,hdfs://clusterB
  • (Spark 2.2+) spark.yarn.access.hadoopFileSystems=hdfs://clusterA,hdfs://clusterB

Secondly, when you creating Spark’s Java context, add that:

javaSparkContext.hadoopConfiguration().addResource(new Path("core-site-clusterB.xml"));
javaSparkContext.hadoopConfiguration().addResource(new Path("hdfs-site-clusterB.xml"));

You need to go to clusterB and gather core-site.xml and hdfs-site.xml from there (default location for Cloudera is /etc/hadoop/conf) and put near your app running in clusterA.

Pay attention for that points:

  • we are specifying core-site.xml and hdfs-site.xml, not just one of them
  • we are sending Path object to addResource() method, not just ordinary String!


If you facing clusterB, then try to put full namenode address of your remote HDFS with port (instead of hdfs/cluster short name) to --conf into your running command:

  • (Spark 1.x-2.1) spark.yarn.access.namenodes=hdfs://clusterA,hdfs://namenode.fqdn:port
  • (Spark 2.2+) spark.yarn.access.hadoopFileSystems=hdfs://clusterA,hdfs://namenode.fqdn:port

If you facing AccessControlException error, check this:

If it’s not working for you, or you facing with another error, check “Troubleshooting” section in my recent article:

Additional thanks

Telegram channel

If you still have any questions, feel free to ask me in the comments under this article or write me at

If I saved your day, you can support me 🤝