In this article we will find the answer for questions: How to change the column in Spark? How to modify column in Spark? In other words: how to create a user defined function (UDF) and apply it.
For example, let’s have a look to UDF, that takes a String and returns a String.
For Spark < 2.3:
// for Spark < 2.3
UDF1<String, String> myUdf = param -> {
String result = param + "hello_from_udf";
return result;
};
Consider the function declaration: UDF1<String, String> partitionKey
The first String is the type of the param parameter. The second String is the return type of result.
For Spark >= 2.3:
// for Spark >= 2.3
UserDefinedFunction myUdf = udf(
(String param) -> {
String result = param + "hello_from_udf";
return result;
}, DataTypes.StringType
);
After we have declared and described the function, it needs to be registered:
sqlContext.udf().register("myUdfRegisteredName", myUdf, DataTypes.StringType);
Then it can be used by calling callUDF
:
Dataset<?> df = sqlContext.parquetFile(taskSource);
df.withColumn("calculatedColumn", callUDF("myUdfRegisteredName", col("sourceColumn")))
.write()
.parquet(“/data/tmp”);
Thus, we will apply myUdf
to each value of the sourceColumn
, and make the calculatedColumn
from that values.
If you still have any questions, feel free to ask me in the comments under this article, or write me on promark33@gmail.com.