::Go back to Oozie Documentation Index::
HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools - Pig, MapReduce, and Hive - to more easily read and write data on the grid. HCatalog's table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS).
Read HCatalog Documentation to know more about HCatalog. Working with HCatalog using pig is detailed in HCatLoader and HCatStorer . Working with HCatalog using MapReduce directly is detailed in HCatInputFormat and HCatOutputFormat .
HCatalog provides notifications through a JMS provider like ActiveMQ when a new partition is added to a table in the database. This allows applications to consume those events and schedule the work that depends on them. In case of Oozie, the notifications are used to determine the availability of HCatalog partitions defined as data dependencies in the Coordinator and trigger workflows.
Read HCatalog Notification to know more about notifications in HCatalog.
Oozie's Coordinators so far have been supporting HDFS directories as a input data dependency. When a HDFS URI template is specified as a dataset and input events are defined in Coordinator for the dataset, Oozie performs data availability checks by polling the HDFS directory URIs resolved based on the nominal time. When all the data dependencies are met, the Coordinator's workflow is triggered which then consumes the available HDFS data.
With addition of HCatalog support, Coordinators also support specifying a set of HCatalog table partitions as a dataset. The workflow is triggered when the HCatalog table partitions are available and the workflow actions can then read the partition data. A mix of HDFS and HCatalog dependencies can be specified as input data dependencies. Similar to HDFS directories, HCatalog table partitions can also be specified as output dataset events.
With HDFS data dependencies, Oozie has to poll HDFS every time to determine the availability of a directory. If the HCatalog server is configured to publish partition availability notifications to a JMS provider, Oozie can be configured to subscribe to it and trigger jobs immediately. This pub-sub model reduces pressure on Namenode and also cuts down on delays caused by polling intervals.
In the absence of a message bus in the deployment, Oozie will always poll the HCatalog server directly for partition availability with the same frequency as the HDFS polling. Even when subscribed to notifications, Oozie falls back to polling HCatalog server for partitions that were available before the coordinator action was materialized and to deal with missed notifications due to system downtimes. The frequency of the fallback polling is usually lower than the constant polling. Defaults are 10 minutes and 1 minute respectively.
Refer to HCatalog Configuration section of Oozie Install documentation for the Oozie server side configuration required to support HCatalog table partitions as a data dependency.
Oozie supports specifying HCatalog partitions as a data dependency through a URI notation. The HCatalog partition URI is used to identify a set of table partitions: hcat://bar:8020/logsDB/logsTable/dt=20090415;region=US.
The format to specify a HCatalog table partition URI is hcat://[metastore server]:[port]/[database name]/[table name]/[partkey1]=[value];[partkey2]=[value];...
For example,
<dataset name="logs" frequency="${coord:days(1)}" initial-instance="2009-02-15T08:15Z" timezone="America/Los_Angeles"> <uri-template> hcat://myhcatmetastore:9080/database1/table1/datestamp=${YEAR}${MONTH}${DAY}${HOUR};region=USA </uri-template> </dataset>
A workflow action interacting with HCatalog requires the following jars in the classpath: hcatalog-core.jar, hcatalog-pig-adapter.jar, webhcat-java-client.jar, hive-common.jar, hive-exec.jar, hive-metastore.jar, hive-serde.jar and libfb303.jar. hive-site.xml which has the configuration to talk to the HCatalog server also needs to be in the classpath. The correct version of HCatalog and hive jars should be placed in classpath based on the version of HCatalog installed on the cluster.
The jars can be added to the classpath of the action using one of the below ways.
Refer to Coordinator Functional Specification for more information about
Refer to Workflow Functional Specification for more information about