转 Managing kettle job configuration

Over time I've grown a habit of making a configuration file for my kettle jobs. This is especially useful if you have a reusable job, where the same work has to be done but against different conditions. A simple example where I found this useful is when you have separate development, testing and production environments: when you're done developing your job, you transfer the .kjb file (and its dependencies) to the testing environment. This is the easy part. But the job still has to run within the new environment, against different database connections, webservice urls and file system paths.

Variables

In the past, much has been written about using kettle variables, parameters and arguments. Variables are the basic features that provide the mechanism to configure the transformation steps and job entries: instead of using literal configuration values, you use a variable reference. This way, you can initialize all variables to whatever values are appropriate at that time, and for that environment. Today, I don't want to discuss variables and variable references - instead I'm just focussing on how to manage the configuration once you already used variable references inside your your jobs and transformations.

Managing configuration

To manage the configuration, I typically start the main job with a set-variables.ktr transformation. This transformation reads configuration data from a config.properties file and assigns it to the variables so any subsequent jobs and transformations can access the configration data through variable references. The main job has one parameter called ${CONFIG_DIR} which has to be set by the caller so the set-variables.ktr transformation knows where to look for its config.properties file:

Reading configuration properties

The config.properties file is just a list of key/value pairs separated by an equals sign. Each key represents a variable name, and the value the appropriate value. The following snippet should give you an idea:

#staging database connection
STAGING_DATABASE=staging
STAGING_HOST=localhost
STAGING_PORT=3351
STAGING_USER=staging
STAGING_PASSWORD=$74g!n9

The set-variables.ktr transformation reads it using a "Property Input" step, and this yields a stream of key/value pairs:

Pivoting key/value pairs to use the "set variables" step

In the past, I used to set the variables using the "Set variables" step. This step works by creating a variable from selected fields in the incoming stream and assigning the field value to it. This means that you can't just feed the stream of key/value pairs from the property input step into the set variables step: the stream coming out of the property input step contains multiple rows with just two fields called "Key" and "value". Feeding it directly into the "Set variables" step would just lead to creating two variables called Key and Value, and they would be assigned values multiple times for all key/value pairs in the stream. So in order to meaningfully assign variable, I used to pivot the stream of key/value pairs into a single row having one field for each key in the stream using the "Row Denormaliser" step:

key field: the value of this field is scanned to determine in which output fields to put the corresponding value. There are no fields that make up a grouping: rather, we want all key/value pairs to end up in one big row. Or put another way, there is just one group comprising all key/value pairs. Finally, the grid below specifies for each distinct value of the "Key" field to which output field name it should be mapped, and in all cases, we want the value of the "Value" field to be stored in those fields. 

Drawbacks

There are two important drawbacks to this approach:

  • The "Row Normaliser" uses the value of the keys to map the value to a new field. This means that we have to type the name of each and every variable appearing in the config.properties file. So you manually need to keep he config.propeties and the "Denormaliser" synchronized, and in practice it's very easy to make mistakes here.
  • Due to the fact that the "Row Denormaliser" step literally needs to know all variables, the set-variables.ktr transformation becomes specific for just one particular project.

Given these drawbacks, I seriously started to question the usefulness of a separate configuration file: because the set-variables.ktrtransformation has to know all variables names anyway, I was tempted to store the configration values themselves also inside the transformation (using a "generate rows" or "data grid" step or something like that), and "simply" make a new set-variables.ktrtransformation for every environment. Of course, that didn't feel right either.

Solution: Javascript

As it turns out, there is in fact a very simple solution that solves all of these problems: don't use the "set variables" step for this kind of problem! We still need to set the variables of course, but we can conveniently do this using a JavaScript step. The new set-variables.ktr transformation now looks like this:

The actual variable assignemnt is done with Kettle's built-in setVariable(key, value, scope). The key and value from the incoming stream are passed as arguments to the key and value arguments of the setVariable() function. The third argument of thesetVariable() function is a string that identifies the scope of the variable, and must have one of the following values:

  • "s" - system-wide
  • "r" - up to the root
  • "p" - up to the parent job of this transormation
  • "g" - up to the grandparent job of this transormation

For my purpose, I settle for "r".

The bonus is that this set-variables.ktr is less complex than the previous one and is now even completely independent of the content of the configuration. It has become a reusable transformation that you can use over and over.

------------------------------------------------------------------------------------------------------

There are some people argue that why not config these variables in <Kettle Installation Folder>\.kettle\kettle.properties, of course this way can achieve this goal, but this solution is not good, because when the Kettle Job(.kjb) or Kettle Transformation(.ktr) files are moved to another environment(dev, test, alpha, beta, prod..), you need to update this file, this is also need to be done when one or more properties are updated.

See also: http://wiki.pentaho.com/display/EAI/Expose+key-value+pairs+from+a+.properties+file+as+variables

时间: 2024-12-23 15:02:41

转 Managing kettle job configuration的相关文章

Kettle访问IDH2.3中的HBase

摘要 Kettle是一款国外开源的ETL工具,纯java编写,可以在Window.Linux.Unix上运行,绿色无需安装,数据抽取高效稳定.big-data-plugin是kettle中用于访问bigdata,包括hadoop.cassandra.mongodb等nosql数据库的一个插件. 截至目前,kettle的版本为4.4.1,big-data-plugin插件支持cloudera CDH3u4.CDH4.1,暂不支持Intel的hadoop发行版本IDH. 本文主要介绍如何让kettl

Kettle 添加对应hadoop版本的支持

在hdp的官网上有一个ETL工具叫做Talend Open Studio,然后我就下了,并且在群里询问了一下,突然间冒出来一群ETL高手,经高人指点认识了一款叫做Kettle的软件,经过这两天的试用,从直观感受上,Kettle更容易使用和上手,资料更多,界面更友好... 优点很多,这里不一一列举了,关键是它对hadoop的支持我觉得是很全面的. 但是这里面有一个问题出现了,它不支持我现在用的版本,我用的是Hortonworks的HDP1.3,好吧,经过不懈的努力,终于被我搜索到了,哈哈,原来它可

aix-AIX系统上启动KETTLE错误

问题描述 AIX系统上启动KETTLE错误 在AIX系统上启动KETTLE的SPOON.SH报错,错误信息如下: WARN 16-04 18:52:56,295 - Unable to load Hadoop Configuration from "file:///home/app/pdi/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/cdh3u4". For more informatio

kettle工具从windows移植到linux上启动报错:\karaf/deploy does not exist

    kettle工具data-integration从windows上用rar压缩.zip后上传到linux后解压,然后运行spoon.sh,启动报错如下: [root@oratest data-integration]# ./spoon.sh  16:27:44,226 INFO  [KarafInstance]  ******************************************************************************* *** Kara

kettle使用log4j管理输出日志

在使用kettle进行数据分析和清洗时日志非常多而且杂乱,使用原有的日志有时找不到异常的位置,有时日志不够详细,说简单一点就是日志不是我们想要的.因而对kettle日志进行相应的管理就想得尤为重要了.大家都知道java最常用的日志管理包log4j可以很好地实现java日志的管理,然而kettle是使用java开发的,因而log4j是个不二的选择.好了说了这么多,下面看看kettle怎么使用log4j进行日志的管理吧. 1.kettle加载和使用log4j 在转换的JavaScript中添加log

kettle发送带附件的邮件

问题描述 kettle发送带附件的邮件 在用kettle发送带附件的邮件时,邮件能够发送成功,但是邮件里面没有附件,哪位大神给个成功的案例,或者帮忙看看是哪里的问题,谢谢! 解决方案 http://blog.csdn.net/ganggetwo/article/details/50914826 解决方案二: import java.io.File; import java.util.ArrayList; import java.util.List; import java.util.Proper

Linux Oracle to Exadata configuration

Linux Oracle to Exadata configuration Linux Oracle to Exadata configuration (Unidirectional) The diagram illustrates a configuration using a primary Extract process creating a remote trail on the target. The Replicat process applies the changes from

Email Configuration -redmine

原文:http://www.redmine.org/projects/redmine/wiki/EmailConfiguration Configuration Directives This page is a work in progress, the following configuration directives are only a partial list. authentication The type of authentication method expected by

使用Kettle数据迁移添加主键和索引

Kettle是一款国外开源的etl工具,纯java编写,绿色无需安装,主要用于数据抽取.转换.装载.kettle兼容了市面上几十种数据库,故用kettle来做数据库的迁移视乎是个不错的选择. kettle的数据抽取主要在于抽取数据,而没有考虑数据库的函数.存储过程.视图.表结构以及索引.约束等等,而这些东西恰恰都是数据迁移需要考虑的事情.当然,如果在不考虑数据库中的函数.存储过程.视图的情况下,使用kettle进行数据的迁移还算是一个可行的方案. 这篇文章主要是讲述在使用kettle进行数据库的