Spark’s User Defined Functions in Java

In this article we will find the answer for questions: How to change the column in Spark? How to modify column in Spark? In other words: how to create a user defined function (UDF) and apply it. For example, let’s have a look to UDF, that takes a String and returns a String. For Spark […]

READ MORE

Spark failed to connect to the MetaStore Server

The problem You may encounter errors like this when running a Spark script / application: Solutions If you do not need the MetaStore server, there are two ways to disable it. Please note that Spark version >= 2.x is required. The first way (via spark2-submit parameters) The second way (via SparkConf object) Java example: Scala […]

READ MORE

Spark custom parquet OutputCommiter

If you need to make your own implementation of OutputCommiter for spark parquet-output tasks, then first of all you need to make a class that extends from org.apache.hadoop.mapreduce.OutputCommiter: Further, regardless of implementation of the OutputCommiter, you need to register the full class name in the Hadoop (!) configuration of Spark like this: Example for Scala: […]

READ MORE

Spark concurrent write to same HDFS path

The problem Sometimes you need to run such a scenario when several Spark tasks write data along the same path to HDFS. During the execution of tasks, you may encounter some errors: Suppose we have one Spark task, that writes to the hdfs://data/test directory. At runtime, Spark will make a temporary directory: hdfs://data/test/_temporary/0. There is […]

READ MORE

ZooKeeper recursive watcher

If you need to set up a recursive watchers (watch on all nodes), the standard ZooKeeper’s Watcher class will not help much – it is installed on only 1 node (or one-level-forward when you calling getChildren()), and is also a one-time event. This means that after each watch trigger, you need to install a new […]

READ MORE

How ZooKeeper ACL works

In this post I will describe the basic principles of how ACL works in ZooKeeper. ACL is not set recursively and is not inherited by the child nodes. If we have a read-only ACL for /path1/path2 or /path1/path2/path3, then deleting /path1 will fail, regardless of AСL of /path1. Several ACL records can be set on […]

READ MORE

Apache Atlas – Building & Installing

Let’s say we want to got working Apache Atlas instance with embedded Hbase & Solr on our machine. Notice, that you need to install JDK 8 before start. Go to the Apache Atlas GitHub page, and download the zip file with the source code of the latest stable release from here: https://github.com/apache/atlas/tags Important notice: DO […]

READ MORE

How to immediately terminate the Spring Boot Yarn container with an error

Imagine an error or exception occurs while running the Spring Boot Yarn container, and we need to kill container from itself and return an error code. You can use @OnContainerStart annotation as mentioned in this article: https://mchesnavsky.tech/how-to-set-up-exit-code-on-spring-boot-yarn-container. But if we need to stop the container immediately, we just need to call: – where parameter is […]

READ MORE

KeeperErrorCode = ConnectionLoss for /hbase/hbaseid

IMPORTANT! If you trying to install Apache Atlas and receiving this error, there is a separate article: https://mchesnavsky.tech/apache-atlas-building-installing/ Suppose that we are faced with these exceptions. The first: The second: The third: The hbase-client cannot connect to the Zookeeper. You need to pay attention to the address: If there is a real Zookeeper instance at […]

READ MORE