Spark Java access remote HDFS

Suppose we need to work with different HDFS (clusterB, for instance) from our Spark Java application, running on clusterA. Firstly, you need to add –conf key to your run command. Depends on Spark version: Secondly, when you creating Spark’s Java context, add that: You need to go to clusterB and gather core-site.xml and hdfs-site.xml from there (default location for Cloudera is /etc/hadoop/conf) […]

READ MORE

Java access remote HDFS from current Hadoop cluster

Suppose we have our Java app running on Hadoop clusterA, and we want to access remote HDFS based on Hadoop clusterB. Let’s see how we can do it: You need to go to clusterB and gather core-site.xml and hdfs-site.xml from there (default location for Cloudera is /etc/hadoop/conf) and put near your app running in clusterA. […]

READ MORE

Yarn is not aggregating application logs

First of all, you need to check NodeManager logs. There may be at least two problems: Log aggregation is not initialized problem: https://mchesnavsky.tech/log-aggregation-is-not-initialized HDFS DELEGATION TOKEN can’t be found in cache problem: https://mchesnavsky.tech/hdfs-delegation-token-cant-be-found-in-cache Please, refer to corresponding article, or made a note in comments below, if you have any other problem.

READ MORE

Log aggregation is not initialized

You may encounter with Hadoop Yarn exception in NodeManager logs that states: It may happen because NM reboot. The newly launched NM inherited the running application, and it does not know how to collect logs from it.According to the Hadoop Yarn NodeManager source code, instances of log collector classes for each running application are stored […]

READ MORE

HDFS DELEGATION TOKEN can’t be found in cache

The problem can be appears in Hadoop’s NodeManager logs. Usually it means that NodeManager is trying to use an expired / not renewed HDFS delegation token. For example, you can face this error while app log aggregation process. The timeline is: Your application pass HDFS delegation token to the NodeManager through the ContainerLaunchContext class, because […]

READ MORE

Resource changed on src filesystem

Full exception text: This can happen when some process overwrites application files in HDFS application directory while app is running. An example of the situation: You start app instance_1, which stores the distribution files in the hdfs://tmp/app folder. After a while you start the second instance_2 which stores the distribution files in the same HDFS […]

READ MORE

Spark’s User Defined Functions in Java

In this article we will find the answer for questions: How to change the column in Spark? How to modify column in Spark? In other words: how to create a user defined function (UDF) and apply it. For example, let’s have a look to UDF, that takes a String and returns a String. For Spark […]

READ MORE

Spark failed to connect to the MetaStore Server

The problem You may encounter errors like this when running a Spark script / application: Solutions If you do not need the MetaStore server, there are two ways to disable it. Please note that Spark version >= 2.x is required. The first way (via spark2-submit parameters) The second way (via SparkConf object) Java example: Scala […]

READ MORE