::Go back to Oozie Documentation Index::
Critical jobs can have certain SLA requirements associated with them. This SLA can be in terms of time i.e. a maximum allowed time limit associated with when the job should start, by when should it end, and its duration of run. Oozie workflows and coordinators allow defining such SLA limits in the application definition xml.
With the addition of SLA Monitoring, Oozie can now actively monitor the state of these SLA-sensitive jobs and send out notifications for SLA mets and misses.
In versions earlier than 4.x, this was a passive feature where users needed to query the Oozie client SLA API to fetch the records regarding job status changes, and use their own custom calculation engine to compute whether SLA was met or missed, based on initial definition of time limits.
Oozie now also has a SLA tab in the Oozie UI, where users can query for SLA information and have a summarized view of how their jobs fared against their SLAs.
Refer to Notifications Configuration for configuring Oozie server to track SLA for jobs and send notifications.
Oozie allows tracking SLA for meeting the following criteria:
Corresponding to each of these 3 criteria, your jobs are processed for whether Met or Miss i.e.
Expected end-time is the most important criterion for majority of users while deciding overall SLA Met or Miss. Hence the "SLA Status"_ for a job will transition through these four stages
In addition to overshooting expected end-time, and END_MISS (and so an eventual SLA MISS) also occurs when the job does not end successfully e.g. goes to error state - Failed/Killed/Error/Timedout.
To make your jobs trackable for SLA, you simply need to add the
Example:
<workflow-app name="test-wf-job-sla" xmlns="uri:oozie:workflow:0.5" xmlns:sla="uri:oozie:sla:0.2"> <start to="grouper"/> <action name="grouper"> <map-reduce> <job-tracker>jt</job-tracker> <name-node>nn</name-node> <configuration> <property> <name>mapred.input.dir</name> <value>input</value> </property> <property> <name>mapred.output.dir</name> <value>output</value> </property> </configuration> </map-reduce> <ok to="end"/> <error to="end"/> </action> <end name="end"/> <sla:info> <sla:nominal-time>${nominal_time}</sla:nominal-time> <sla:should-start>${10 * MINUTES}</sla:should-start> <sla:should-end>${30 * MINUTES}</sla:should-end> <sla:max-duration>${30 * MINUTES}</sla:max-duration> <sla:alert-events>start_miss,end_miss,duration_miss</sla:alert-events> <sla:alert-contact>joe@example.com</sla:alert-contact> </sla:info> </workflow-app>
For the list of tags usable under
NOTE: All tags can be parameterized.
Same schema can be applied to and embedded under Workflow-Action as well as Coordinator-Action XML.
<workflow-app name="test-wf-action-sla" xmlns="uri:oozie:workflow:0.5" xmlns:sla="uri:oozie:sla:0.2"> <start to="grouper"/> <action name="grouper"> ... <ok to="end"/> <error to="end"/> <sla:info> <sla:nominal-time>${nominal_time}</sla:nominal-time> <sla:should-start>${10 * MINUTES}</sla:should-start> ... </sla:info> </action> <end name="end"/> </workflow-app>
<coordinator-app name="test-coord-sla" frequency="${coord:days(1)}" freq_timeunit="DAY" end_of_duration="NONE" start="2013-06-20T08:01Z" end="2013-12-01T08:01Z" timezone="America/Los_Angeles" xmlns="uri:oozie:coordinator:0.4" xmlns:sla="uri:oozie:sla:0.2"> <action> <workflow> <app-path>${wfAppPath}</app-path> </workflow> <sla:info> <sla:nominal-time>${nominal_time}</sla:nominal-time> ... </sla:info> </action> </coordinator-app>
SLA information is accessible via the following two ways
For JMS Notifications, you have to have a message broker in place, on which Oozie publishes messages and you can hook on a subscriber to receive those messages. For more info on setting up and consuming JMS messages, refer JMS Notifications documentation.
In the REST API, the following filters can be applied while fetching SLA information:
When timezone query parameter is specified, the expected and actual start/end time returned is formatted. If not specified, the number of milliseconds that have elapsed since January 1, 1970 00:00:00.000 GMT is returned.
The examples below demonstrate the use of REST API and explains the JSON response.
Request:
GET <oozie-host>:<port>/oozie/v2/sla?timezone=GMT&filter=nominal_start=2013-06-18T00:01Z;nominal_end=2013-06-23T00:01Z;app_name=my-sla-app
JSON Response
{ id : "000056-1238791320234-oozie-joe-W" parentId : "000001-1238791320234-oozie-joe-C@8" appType : "WORKFLOW_JOB" msgType : "SLA" appName : "my-sla-app" slaStatus : "IN_PROCESS" eventStatus : "START_MISS" user: "joe" nominalTime: "2013-16-22T05:00Z" expectedStartTime: "2013-16-22T05:10Z" <-- (should start by this time) actualStartTime: "2013-16-22T05:30Z" <-- (20 min late relative to expected start) expectedEndTime: "2013-16-22T05:40Z" <-- (should end by this time) actualEndTime: null expectedDuration: 15 actualDuration: null notificationMessage: "My Job has encountered an SLA event!" upstreamApps: "dependent-app-1, dependent-app-2" }
Request:
GET <oozie-host>:<port>/oozie/v2/sla?timezone=GMT&filter=parent_id=000056-1238791320234-oozie-joe-W
JSON Response
{ id : "000056-1238791320234-oozie-joe-W@map-reduce-action" parentId : "000056-1238791320234-oozie-joe-W" appType : "WORKFLOW_ACTION" msgType : "SLA" appName : "map-reduce-action" slaStatus : "MISS" eventStatus : "END_MISS" user: "joe" nominalTime: "2013-16-22T05:00Z" expectedStartTime: "2013-16-22T05:10Z" actualStartTime: "2013-16-22T05:05Z" expectedEndTime: "2013-16-22T05:40Z" <-- (should end by this time) actualEndTime: "2013-16-22T06:00Z" <-- (20 min late relative to expected end) expectedDuration: 60 actualDuration: 55 notificationMessage: "My Job has encountered an SLA event!" upstreamApps: "dependent-app-1, dependent-app-2" }
Request:
GET <oozie-host>:<port>/oozie/v2/sla?timezone=GMT&filter=id=000001-1238791320234-oozie-joe-C
JSON Response
{ id : "000001-1238791320234-oozie-joe-C@2" parentId : "000001-1238791320234-oozie-joe-C" appType : "COORDINATOR_ACTION" msgType : "SLA" appName : "my-coord-app" slaStatus : "MET" eventStatus : "DURATION_MISS" user: "joe" nominalTime: "2013-16-22T05:00Z" expectedStartTime: "2013-16-22T05:10Z" actualStartTime: "2013-16-22T05:05Z" expectedEndTime: "2013-16-22T05:40Z" actualEndTime: "2013-16-22T05:30Z" expectedDuration: 15 <-- (expected duration in minutes) actualDuration: 25 notificationMessage: "My Job has encountered an SLA event!" upstreamApps: "dependent-app-1, dependent-app-2" }
Scenario #3 is particularly interesting because it is an overall "MET" because it met its expected End-time, but it is "Duration_Miss" because the actual run (between actual start and actual end) exceeded expected duration.
Subject: OOZIE - SLA END_MISS (AppName=wf-sla-job, JobID=0000004-130610225200680-oozie-oozi-W)Status: SLA Status - END_MISS Job Status - RUNNING Notification Message - Missed SLA for Data Pipeline job Job Details: App Name - wf-sla-job App Type - WORKFLOW_JOB User - strat_ci Job ID - 0000004-130610225200680-oozie-oozi-W Job URL - http://host.domain.com:4080/oozie//?job=0000004-130610225200680-oozie-oozi-W Parent Job ID - N/A Parent Job URL - N/A Upstream Apps - wf-sla-up-app SLA Details: Nominal Time - Mon Jun 10 23:33:00 UTC 2013 Expected Start Time - Mon Jun 10 23:35:00 UTC 2013 Actual Start Time - Mon Jun 10 23:34:04 UTC 2013 Expected End Time - Mon Jun 10 23:38:00 UTC 2013 Expected Duration (in mins) - 300000 Actual Duration (in mins) - -1
There are two known issues when you define SLA for a workflow action.