Spark’s User Defined Functions in Java

In this article we will find the answer for questions: How to change the column in Spark? How to modify column in Spark? In other words: how to create a user defined function (UDF) and apply it.

For example, let’s have a look to UDF, that takes a String and returns a String.

For Spark < 2.3:

// for Spark < 2.3
UDF1<String, String> myUdf = param -> {
    String result = param + "hello_from_udf";
    return result;
}; 

Consider the function declaration: UDF1<String, String> partitionKey

The first String is the type of the param parameter. The second String is the return type of result.

For Spark >= 2.3:

// for Spark >= 2.3
UserDefinedFunction myUdf = udf(
        (String param) -> {
            String result = param + "hello_from_udf";
            return result;
        }, DataTypes.StringType
);

After we have declared and described the function, it needs to be registered:

sqlContext.udf().register("myUdfRegisteredName", myUdf, DataTypes.StringType);

Then it can be used by calling callUDF:

Dataset<?> df = sqlContext.parquetFile(taskSource);
df.withColumn("calculatedColumn", callUDF("myUdfRegisteredName", col("sourceColumn")))
        .write()
        .parquet(“/data/tmp”);

Thus, we will apply myUdf to each value of the sourceColumn, and make the calculatedColumn from that values.

Telegram channel

If you still have any questions, feel free to ask me in the comments under this article or write me at promark33@gmail.com.

If I saved your day, you can support me 🤝

Leave a Reply

Your email address will not be published. Required fields are marked *