Cloudera Enterprise 5.15.x | Other versions

Installing DataFu

  Warning: DataFu has been in decline for a significant period of time and is now officially deprecated. Cloudera recommends that you replace the DataFu UDFs with Hive UDFs. Hive UDFs provide most of DataFu's functions and many additional functions. Moreover, Hive UDFs are more stable and well-supported. In an upcoming release, Apache Pig will support Hive UDFs. For more information about using Hive UDFs in CDH, see Managing UDFs.

DataFu is a collection of Apache Pig UDFs (User-Defined Functions) for statistical evaluation. They were developed by LinkedIn and are now open source under an Apache 2.0 license.

A number of usage examples and other information are available at https://github.com/linkedin/datafu.

To Use DataFu in a Parcel-deployed Cluster

If your cluster uses parcels, DataFu is installed for you. You need to register the JAR file prior to use with the following command.

REGISTER /opt/cloudera/parcels/CDH/lib/pig/datafu.jar

To Use DataFu in a Package-deployed Cluster:

  1. Install the DataFu package:

    Operating system

    Install command

    Red-Hat-compatible

    sudo yum install pig-udf-datafu

    SLES

    sudo zypper install pig-udf-datafu

    Debian or Ubuntu

    sudo apt-get install pig-udf-datafu

    This puts the DataFu JAR file (for example, datafu-0.0.4-cdh5.0.0.jar) in /usr/lib/pig.

  2. Register the JAR. Replace the <component_version> string with the current DataFu and CDH version numbers.
    REGISTER /usr/lib/pig/datafu-<DataFu_version>-cdh<CDH_version>.jar

    For example:

    REGISTER /usr/lib/pig/datafu-0.0.4-cdh5.0.0.jar
Page generated May 18, 2018.