Chapter 11. Writing Drill User-Defined Functions

In the previous chapters, you learned about Drill’s powerful analytic capabilities. There are many situations in which you might want to perform a transformation on some data and Drill simply does not have the capability readily at hand. However, it is quite possible to extend Drill’s capabilities by writing your own user-defined functions (UDFs).

Drill supports two different types of UDFs: simple and aggregate. A simple UDF accepts a column or expression as input and returns a single result. The result can be a complex type such as an array or map.  An aggregate UDF is different in that it accepts as input all the values for a group as defined in a GROUP BY or similar clause and returns a single result. The SUM() function is a good example: it accepts a column or expression, adds up all the values, and returns a single result. You can use an aggregate UDF in conjunction with the GROUP BY statement as well, and it will perform aggregate operations on a section of the data.

Use Case: Finding and Filtering Valid Credit Card Numbers

Suppose you are conducting security research and you find a large list of what appear to be credit card numbers. You want to determine whether these are valid credit card numbers and, if so, notify the appropriate banks.

A credit card number is not simply a random sequence of digits. Indeed, these numbers are quite specific and can be validated by an algorithm known as the Luhn algorithm. Although Drill ...

Get Learning Apache Drill now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.