Oozie分布式任务的工作流——邮件篇

在大数据的当下,各种spark和hadoop的框架层出不穷。各种高端的计算框架,分布式任务如乱花般迷眼。你是否有这种困惑!——有了许多的分布式任务,但是每天需要固定时间跑任务,自己写个调度,既不稳定,又没有可靠的通知。

想要了解Oozie的基础知识,可以参考这里

那么你应该是在找——Oozie。

Oozie是一款支持分布式任务调度的开源框架,它支持很多的分布式任务,比如map reduce,spark,sqoop,pig甚至shell等等。你可以以各种方式调度它们,把它们组成工作流。每个工作流节点可以串行也可以并行执行。

如果你定义好了一系列的任务,就可以开启工作流,设置一个coordinator调度器进行定时的调度了。

有了这些工作以后,还需要一个很重要的环节—— 就是邮件提醒。不管是任务执行成功还是失败,都可以发送邮件提醒。这样每天晚上收到任务成功的消息,就可以安心睡觉了。

因此,本篇就带你来看看如何在Oozie中使用Email。

Email Action

在Oozie中每个工作流的环节都被设计成一个Action,email就是其中的一个Action.

Email action可以在oozie中发送信息,在email action中必须指定接收的地址,主题subject和内容body。在接收地址参数中支持使用逗号分隔,添加多个邮箱地址。

email action是同步执行的,因此必须等到邮件发出后,这个action才算完成,才能执行下一个action。

email action里面的所有参数都可以使用EL表达式。

语法规则

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="[NODE-NAME]">
        <email xmlns="uri:oozie:email-action:0.2">
            <to>[COMMA-SEPARATED-TO-ADDRESSES]</to>
            <cc>[COMMA-SEPARATED-CC-ADDRESSES]</cc> <!-- cc is optional -->
            <subject>[SUBJECT]</subject>
            <body>[BODY]</body>
            <content_type>[CONTENT-TYPE]</content_type> <!-- content_type is optional -->
            <attachment>[COMMA-SEPARATED-HDFS-FILE-PATHS]</attachment> <!-- attachment is optional -->
        </email>
        <ok to="[NODE-NAME]"/>
        <error to="[NODE-NAME]"/>
    </action>
    ...
</workflow-app>

to和cc命令指定了谁来接收邮件。可以通过逗号分隔来指定多个邮箱地址。to是必填项,cc是可选的。

主题subject和正文body用于指定邮件的标题和正文,email-action:0.2支持text/html这种格式的正文,默认是普通的文本"text/plain"

attachment用于在邮件中添加一个hdfs文件的附件,也可以通过逗号分隔符指定多个附件。如果路径声明的不全,那么也会被当做hdfs中的文件。本地文件是不能添加到附件中的。

配置

email action需要在oozie-site.xml中配置SMTP服务器配置。下面是需要配置的值:

oozie.email.smtp.host

这个值是SMTP服务器的地址,默认是loalhost

oozie.email.smtp.port

是SMTP服务器的端口号,默认是25.

oozie.email.from.address

发送邮件的地址,默认是oozie@localhost

oozie.email.smtp.auth

是否开启认证,默认不开启

oozie.email.smtp.username

如果开启认证,登录的用户名,默认是空

oozie.email.smtp.password

如果开启认证,用户对应的密码,默认是空

PS. 在linux可以通过find -name oozie-site.xml在当前目录下查找。在我们的CDH版本中这个文件在./etc/oozie/conf.dist/oozie-site.xml

遇到的问题

很多人会遇到无法发邮件的问题,首先是要开启SMTP服务,查看是否开启可以使用telnet localhost 25

另外,如果使用的是企业邮箱,需要注意发件人的格式,必须符合企业邮箱的设置。并且收件人只能是企业邮箱的地址。

在Cloudera Mnager中的配置如下图:

样例

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="an-email">
        <email xmlns="uri:oozie:email-action:0.1">
            <to>bob@initech.com,the.other.bob@initech.com</to>
            <cc>will@initech.com</cc>
            <subject>Email notifications for ${wf:id()}</subject>
            <body>The wf ${wf:id()} successfully completed.</body>
        </email>
        <ok to="myotherjob"/>
        <error to="errorcleanup"/>
    </action>
    ...
</workflow-app>

上面的例子中,邮件发给了bob,the.other.bob以及抄送给will,并指定了邮件的标题和正文以及workflow的id。

附录

为了更多的了解Oozie,这里直接给出了Oozie相关的重要配置

oozie-site.xml配置

<?xml version="1.0"?>
<configuration>
    <!--oozie-default.xml文件是默认的配置-->
    <property>
        <name>oozie.service.ProxyUserService.proxyuser.hue.hosts</name>
        <value>*</value>
    </property>
    <property>
        <name>oozie.service.ProxyUserService.proxyuser.hue.groups</name>
        <value>*</value>
    </property>
</configuration>

oozie-defualt.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
-->
<configuration>

    <!-- ************************** VERY IMPORTANT  ************************** -->
    <!-- This file is in the Oozie configuration directory only for reference. -->
    <!-- It is not loaded by Oozie, Oozie uses its own privatecopy.            -->
    <!-- ************************** VERY IMPORTANT  ************************** -->

    <property>
        <name>oozie.output.compression.codec</name>
        <value>gz</value>
        <description>
            The name of the compression codec to use.
            where codec class implements the interface org.apache.oozie.compression.CompressionCodec.
            If oozie.compression.codecs is not specified, gz codec implementation is used by default.
        </description>
    </property>

    <property>
        <name>oozie.action.mapreduce.uber.jar.enable</name>
        <value>false</value>
        <description>
            which specify the oozie.mapreduce.uber.jar configuration property will fail.
        </description>
    </property>

    <property>
        <name>oozie.processing.timezone</name>
        <value>UTC</value>
        <description>
            is changed, note that GMT(+/-)#### timezones do not observe DST changes.
        </description>
    </property>

    <!-- Base Oozie URL: <SCHEME>://<HOST>:<PORT>/<CONTEXT> -->

    <property>
        <name>oozie.base.url</name>
        <value>http://localhost:8080/oozie</value>
        <description>
             Base Oozie URL.
        </description>
    </property>

    <!-- Services -->

    <property>
        <name>oozie.system.id</name>
        <value>oozie-${user.name}</value>
        <description>
            The Oozie system ID.
        </description>
    </property>

    <property>
        <name>oozie.systemmode</name>
        <value>NORMAL</value>
        <description>
            System mode for  Oozie at startup.
        </description>
    </property>

    <property>
        <name>oozie.delete.runtime.dir.on.shutdown</name>
        <value>true</value>
        <description>
            If the runtime directory should be kept after Oozie shutdowns down.
        </description>
    </property>

    <property>
        <name>oozie.services</name>
        <value>
            org.apache.oozie.service.SchedulerService,
            org.apache.oozie.service.InstrumentationService,
            org.apache.oozie.service.MemoryLocksService,
            org.apache.oozie.service.UUIDService,
            org.apache.oozie.service.ELService,
            org.apache.oozie.service.AuthorizationService,
            org.apache.oozie.service.UserGroupInformationService,
            org.apache.oozie.service.HadoopAccessorService,
/email
            IMPORTANT: if the StoreServicePasswordService is active, it will reset this value with the
value given in
                       the console.
        </description>
    </property>

    <property>
        <name>oozie.service.JPAService.pool.max.active.conn</name>
        <value>10</value>
        <description>
             Max number of connections.
        </description>
    </property>

   <!-- SchemaService -->

    <property>
        <name>oozie.service.SchemaService.wf.schemas</name>
        <value>
            oozie-workflow-0.1.xsd,oozie-workflow-0.2.xsd,oozie-workflow-0.2.5.xsd,oozie-workflow-0.3.x
sd,oozie-workflow-0.4.xsd,
            oozie-workflow-0.4.5.xsd,oozie-workflow-0.5.xsd,
            shell-action-0.1.xsd,shell-action-0.2.xsd,shell-action-0.3.xsd,
            email-action-0.1.xsd,email-action-0.2.xsd,
            hive-action-0.2.xsd,hive-action-0.3.xsd,hive-action-0.4.xsd,hive-action-0.5.xsd,hive-action
-0.6.xsd,
            sqoop-action-0.2.xsd,sqoop-action-0.3.xsd,sqoop-action-0.4.xsd,
            ssh-action-0.1.xsd,ssh-action-0.2.xsd,
            distcp-action-0.1.xsd,distcp-action-0.2.xsd,
            oozie-sla-0.1.xsd,oozie-sla-0.2.xsd,
            hive2-action-0.1.xsd, hive2-action-0.2.xsd,
            spark-action-0.1.xsd,spark-action-0.2.xsd
        </value>
        <description>
            List of schemas for workflows (separated by commas).
        </description>
    </property>

    <property>
        <name>oozie.service.SchemaService.wf.ext.schemas</name>
        <value> </value>
        <description>
            List of additional schemas for workflows (separated by commas).
        </description>
    </property>

    <property>
        <name>oozie.service.SchemaService.coord.schemas</name>
/email
        <description>
             Base console URL for a workflow job.
        </description>
    </property>

    <!-- ActionService -->

    <property>
        <name>oozie.service.ActionService.executor.classes</name>
        <value>
            org.apache.oozie.action.decision.DecisionActionExecutor,
            org.apache.oozie.action.hadoop.JavaActionExecutor,
            org.apache.oozie.action.hadoop.FsActionExecutor,
            org.apache.oozie.action.hadoop.MapReduceActionExecutor,
            org.apache.oozie.action.hadoop.PigActionExecutor,
            org.apache.oozie.action.hadoop.HiveActionExecutor,
            org.apache.oozie.action.hadoop.ShellActionExecutor,
            org.apache.oozie.action.hadoop.SqoopActionExecutor,
            org.apache.oozie.action.hadoop.DistcpActionExecutor,
            org.apache.oozie.action.hadoop.Hive2ActionExecutor,
            org.apache.oozie.action.ssh.SshActionExecutor,
            org.apache.oozie.action.oozie.SubWorkflowActionExecutor,
            org.apache.oozie.action.email.EmailActionExecutor,
            org.apache.oozie.action.hadoop.SparkActionExecutor
        </value>
        <description>
            List of ActionExecutors classes (separated by commas).
            Only action types with associated executors can be used in workflows.
        </description>
    </property>

    <property>
        <name>oozie.service.ActionService.executor.ext.classes</name>
        <value> </value>
        <description>
            List of ActionExecutors extension classes (separated by commas). Only action types with ass
ociated
            executors can be used in workflows. This property is a convenience property to add extensio
ns to the built
            in executors without having to include all the built in ones.
        </description>
    </property>

    <!-- ActionCheckerService -->

    <property>
        <name>oozie.service.ActionCheckerService.action.check.interval</name>
/email
        <description>
            Comma separated AUTHORITY=SPARK_CONF_DIR, where AUTHORITY is the HOST:PORT of
            the ResourceManager of a YARN cluster. The wildcard '*' configuration is
            used when there is no exact match for an authority. The SPARK_CONF_DIR contains
            the relevant spark-defaults.conf properties file. If the path is relative is looked within
            the Oozie configuration directory; though the path can be absolute.  This is only used
            when the Spark master is set to either "yarn-client" or "yarn-cluster".
        </description>
    </property>

    <property>
        <name>oozie.service.SparkConfigurationService.spark.configurations.ignore.spark.yarn.jar</name>
        <value>true</value>
        <description>
            If true, Oozie will ignore the "spark.yarn.jar" property from any Spark configurations spec
ified in
            oozie.service.SparkConfigurationService.spark.configurations.  If false, Oozie will not ign
ore it.  It is recommended
            to leave this as true because it can interfere with the jars in the Spark sharelib.
        </description>
    </property>

    <property>
        <name>oozie.email.attachment.enabled</name>
        <value>true</value>
        <description>
            This value determines whether to support email attachment of a file on HDFS.
            Set it false if there is any security concern.
        </description>
    </property>

    <property>
        <name>oozie.actions.default.name-node</name>
        <value> </value>
        <description>
            The default value to use for the &lt;name-node&gt; element in applicable action types.  Thi
s value will be used when
            neither the action itself nor the global section specifies a &lt;name-node&gt;.  As expecte
d, it should be of the form
            "hdfs://HOST:PORT".
        </description>
    </property>

    <property>
        <name>oozie.actions.default.job-tracker</name>
        <value> </value>
        <description>
@
search hit BOTTOM, continuing at TOP
            IMPORTANT: if the StoreServicePasswordService is active, it will reset this value with the
value given in
                       the console.
        </description>
    </property>

    <property>
        <name>oozie.service.JPAService.pool.max.active.conn</name>
        <value>10</value>
        <description>
             Max number of connections.
        </description>
    </property>

   <!-- SchemaService -->

    <property>
        <name>oozie.service.SchemaService.wf.schemas</name>
        <value>
            oozie-workflow-0.1.xsd,oozie-workflow-0.2.xsd,oozie-workflow-0.2.5.xsd,oozie-workflow-0.3.x
sd,oozie-workflow-0.4.xsd,
            oozie-workflow-0.4.5.xsd,oozie-workflow-0.5.xsd,
            shell-action-0.1.xsd,shell-action-0.2.xsd,shell-action-0.3.xsd,
            email-action-0.1.xsd,email-action-0.2.xsd,
            hive-action-0.2.xsd,hive-action-0.3.xsd,hive-action-0.4.xsd,hive-action-0.5.xsd,hive-action
-0.6.xsd,
            sqoop-action-0.2.xsd,sqoop-action-0.3.xsd,sqoop-action-0.4.xsd,
            ssh-action-0.1.xsd,ssh-action-0.2.xsd,
            distcp-action-0.1.xsd,distcp-action-0.2.xsd,
            oozie-sla-0.1.xsd,oozie-sla-0.2.xsd,
            hive2-action-0.1.xsd, hive2-action-0.2.xsd,
            spark-action-0.1.xsd,spark-action-0.2.xsd
        </value>
        <description>
            List of schemas for workflows (separated by commas).
        </description>
    </property>

    <property>
        <name>oozie.service.SchemaService.wf.ext.schemas</name>
        <value> </value>
        <description>
            List of additional schemas for workflows (separated by commas).
        </description>
    </property>

    <property>
        <name>oozie.service.SchemaService.coord.schemas</name>
/email
        <description>
             Base console URL for a workflow job.
        </description>
    </property>

    <!-- ActionService -->

    <property>
        <name>oozie.service.ActionService.executor.classes</name>
        <value>
            org.apache.oozie.action.decision.DecisionActionExecutor,
            org.apache.oozie.action.hadoop.JavaActionExecutor,
            org.apache.oozie.action.hadoop.FsActionExecutor,
            org.apache.oozie.action.hadoop.MapReduceActionExecutor,
            org.apache.oozie.action.hadoop.PigActionExecutor,
            org.apache.oozie.action.hadoop.HiveActionExecutor,
            org.apache.oozie.action.hadoop.ShellActionExecutor,
            org.apache.oozie.action.hadoop.SqoopActionExecutor,
            org.apache.oozie.action.hadoop.DistcpActionExecutor,
            org.apache.oozie.action.hadoop.Hive2ActionExecutor,
            org.apache.oozie.action.ssh.SshActionExecutor,
            org.apache.oozie.action.oozie.SubWorkflowActionExecutor,
            org.apache.oozie.action.email.EmailActionExecutor,
            org.apache.oozie.action.hadoop.SparkActionExecutor
        </value>
        <description>
            List of ActionExecutors classes (separated by commas).
            Only action types with associated executors can be used in workflows.
        </description>
    </property>

    <property>
        <name>oozie.service.ActionService.executor.ext.classes</name>
        <value> </value>
        <description>
            List of ActionExecutors extension classes (separated by commas). Only action types with ass
ociated
            executors can be used in workflows. This property is a convenience property to add extensio
ns to the built
            in executors without having to include all the built in ones.
        </description>
    </property>

    <!-- ActionCheckerService -->

    <property>
        <name>oozie.service.ActionCheckerService.action.check.interval</name>
/email
        <description>
            used when there is no exact match for an authority. The SPARK_CONF_DIR contains
            the relevant spark-defaults.conf properties file. If the path is relative is looked within
            the Oozie configuration directory; though the path can be absolute.  This is only used
            when the Spark master is set to either "yarn-client" or "yarn-cluster".
        </description>
    </property>

    <property>
        <name>oozie.service.SparkConfigurationService.spark.configurations.ignore.spark.yarn.jar</name>
        <value>true</value>
        <description>
            If true, Oozie will ignore the "spark.yarn.jar" property from any Spark configurations spec
ified in
            oozie.service.SparkConfigurationService.spark.configurations.  If false, Oozie will not ign
ore it.  It is recommended
            to leave this as true because it can interfere with the jars in the Spark sharelib.
        </description>
    </property>

    <property>
        <name>oozie.email.attachment.enabled</name>
        <value>true</value>
        <description>
            This value determines whether to support email attachment of a file on HDFS.
            Set it false if there is any security concern.
        </description>
    </property>

    <property>
        <name>oozie.actions.default.name-node</name>
        <value> </value>
        <description>
            The default value to use for the &lt;name-node&gt; element in applicable action types.  Thi
s value will be used when
            neither the action itself nor the global section specifies a &lt;name-node&gt;.  As expecte
d, it should be of the form
            "hdfs://HOST:PORT".
        </description>
    </property>

    <property>
        <name>oozie.actions.default.job-tracker</name>
        <value> </value>
        <description>
            The default value to use for the &lt;job-tracker&gt; element in applicable action types.  T
his value will be used when
            neither the action itself nor the global section specifies a &lt;job-tracker&gt;.  As expec
ted, it should be of the form
            "HOST:PORT".
        </description>
    </property>

</configuration>

本文转自博客园xingoo的博客,原文链接:Oozie分布式任务的工作流——邮件篇,如需转载请自行联系原博主。

时间: 2024-08-01 05:17:39

Oozie分布式任务的工作流——邮件篇的相关文章

Oozie分布式任务的工作流——脚本篇

继前一篇大体上翻译了Email的Action配置,本篇继续看一下Shell的相关配置. Shell Action Shell Action可以执行Shell脚本命令,工作流会等到shell完全执行完毕后退出,再执行下一个节点.为了运行shell,必须配置job-tracker以及name-node,并且设置exec来执行shell. Shell既可以使用job-xml引用一个配置文件,也可以在shell action内直接配置.shell action中的配置会覆盖job-xml中的配置. EL

Oozie分布式任务的工作流——Sqoop篇

Sqoop的使用应该是Oozie里面最常用的了,因为很多BI数据分析都是基于业务数据库来做的,因此需要把mysql或者oracle的数据导入到hdfs中再利用mapreduce或者spark进行ETL,生成报表信息. 因此本篇的Sqoop Action其实就是运行一个sqoop的任务而已. 同样action会等到sqoop执行成功后,才会执行下一个action.为了运行sqoop action,需要提供job-tracker,name-node,command或者arg元素. sqoop act

Oozie分布式任务的工作流——Spark篇

Spark是现在应用最广泛的分布式计算框架,oozie支持在它的调度中执行spark.在我的日常工作中,一部分工作就是基于oozie维护好每天的spark离线任务,合理的设计工作流并分配适合的参数对于spark的稳定运行十分重要. Spark Action 这个Action允许执行spark任务,需要用户指定job-tracker以及name-node.先看看语法规则: 语法规则 <workflow-app name="[WF-DEF-NAME]" xmlns="uri

Oozie分布式工作流——从理论和实践分析使用节点间的参数传递

Oozie支持Java Action,因此可以自定义很多的功能.本篇就从理论和实践两方面介绍下Java Action的妙用,另外还涉及到oozie中action之间的参数传递. 本文大致分为以下几个部分: Java Action教程文档 自定义Java Action实践 从源码的角度讲解Java Action与Shell Action的参数传递. 如果你即将或者想要使用oozie,那么本篇的文章将会为你提供很多参考的价值. Java Action文档 java action会自动执行提供的jav

Oozie分布式工作流——Action节点

前篇讲述了下什么是流控制节点,本篇继续来说一下什么是 Action Nodes操作节点.Action节点有一些比较通用的特性: Action节点是远程的 所有oozie创建的计算和处理任务都是异步的,没有任何应用是工作在oozie内部的.基本上都是创建一个oozie任务,oozie任务会以map的形式,在各个节点再创建相应的任务.因此当你执行spark任务的时候,就会发现yarn集群监控列表里面会同时有两个任务出现. Action节点是异步的 oozie创建的任务都是异步的,对于大多数的任务来说

Oozie分布式工作流——流控制

最近又开始捅咕上oozie了,所以回头还是翻译一下oozie的文档.文档里面最重要就属这一章了--工作流定义. 一提到工作流,首先想到的应该是工作流都支持哪些工作依赖关系,比如串式的执行,或者一对多,或者多对一,或者条件判断等等.Oozie在这方面支持的很好,它把节点分为控制节点和操作节点两种类型,控制节点用于控制工作流的计算流程,操作节点用于封装计算单元.本篇就主要描述下它的控制节点... 背景 先看看oozie工作流里面的几个定义: action,一个action是一个独立的任务,比如map

大数据学习之路(持续更新中...)

在16年8月份至今,一直在努力学习大数据大数据相关的技术,很想了解众多老司机的学习历程.因为大数据涉及的技术很广需要了解的东西也很多,会让很多新手望而却步.所以,我就在自己学习的过程中总结一下学到的内容以及踩到的一些坑,希望得到老司机的指点和新手的借鉴. 前言 在学习大数据之前,先要了解他解决了什么问题,能给我们带来什么价值.一方面,以前IT行业发展没有那么快,系统的应用也不完善,数据库足够支撑业务系统.但是随着行业的发展,系统运行的时间越来越长,搜集到的数据也越来越多,传统的数据库已经不能支撑

工作流调度器azkaban安装

概述 2.1.1为什么需要工作流调度系统 一个完整的数据分析系统通常都是由大量任务单元组成: shell脚本程序,java程序,mapreduce程序.hive脚本等 各任务单元之间存在时间先后及前后依赖关系 为了很好地组织起这样的复杂执行计划,需要一个工作流调度系统来调度执行; 例如,我们可能有这样一个需求,某个业务系统每天产生20G原始数据,我们每天都要对其进行处理,处理步骤如下所示: 1. 通过Hadoop先将原始数据同步到HDFS上; 2. 借助MapReduce计算框架对原始数据进行转

谈谈分布式事务之四: 两种事务处理协议OleTx与WS-AT

在年前写一个几篇关于分布式事务的文章,实际上这些都是为了系统介绍WCF事务处理体系而提供的相关的背景和基础知识.今天发最后一篇,介绍分布式事务采用的两种协议,即OleTx和WS-AT,内容比较枯燥,但对于后续对WCF事务处理框架进行深入剖析的系列文章来说,确是不可以缺少的.总的来说,基于WCF的分布式事务采用的是两阶段提交(2PC:Two Phase Commit)协议.具体来说,我们可以选择如下两种事务处理协议实现WCF的分布式式事务,它们按照各自的方式提供了对两阶段提交的实现. OleTx: