Greenplum通过gp_dist_random('gp_id') 在所有节点调用某个函数

使用greenplum时,如果需要调用一个函数,这个函数很可能就在master执行,而不会跑到segment上去执行。
例如 random()函数。
通过select random()来调用的话,不需要将这条SQL发送到segment节点,所以执行计划如下,没有gather motion的过程。

postgres=# explain analyze select random();
                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Result  (cost=0.01..0.02 rows=1 width=0)
   Rows out:  1 rows with 0.017 ms to end, start offset by 0.056 ms.
   InitPlan
     ->  Result  (cost=0.00..0.01 rows=1 width=0)
           Rows out:  1 rows with 0.004 ms to end of 2 scans, start offset by 0.059 ms.
 Slice statistics:
   (slice0)    Executor memory: 29K bytes.
   (slice1)    Executor memory: 29K bytes.
 Statement statistics:
   Memory used: 128000K bytes
 Total runtime: 0.074 ms
(11 rows)

如果要让这个函数在segment执行,怎么办呢?
通过gp_dist_random('gp_id')来调用,gp_dist_random的参数是一个可查询的视图,或表。

postgres=# explain analyze select random() from gp_dist_random('gp_id');
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 240:1  (slice1; segments: 240)  (cost=0.00..4.00 rows=240 width=0)
   Rows out:  240 rows at destination with 6.336 ms to first row, 59 ms to end, start offset by 4195 ms.
   ->  Seq Scan on gp_id  (cost=0.00..4.00 rows=1 width=0)
         Rows out:  Avg 1.0 rows x 240 workers.  Max 1 rows (seg0) with 0.073 ms to first row, 0.075 ms to end, start offset by 4207 ms.
 Slice statistics:
   (slice0)    Executor memory: 471K bytes.
   (slice1)    Executor memory: 163K bytes avg x 240 workers, 163K bytes max (seg0).
 Statement statistics:
   Memory used: 128000K bytes
 Total runtime: 4279.445 ms
(10 rows)

gp_id在每个segment中都有一条记录,所以以上SQL会在每个SEGMENT中调用一次random()并返回所有结果,例如我的测试环境中有240个segment, 那么以上SQL将返回240条记录。

在gp_id的定义中,介绍了gp_dist_random用它可以做一些管理的工作:
譬如查询数据库的大小,查询表的大小,其实都是这样统计的。
src/backend/catalog/postgres_bki_srcs

/*-------------------------------------------------------------------------
 *
 * gp_id.h
 *        definition of the system "database identifier" relation (gp_dbid)
 *        along with the relation's initial contents.
 *
 * Copyright (c) 2009-2010, Greenplum inc
 *
 * NOTES
 *    Historically this table was used to supply every segment with its
 * identification information.  However in the 4.0 release when the file
 * replication feature was added it could no longer serve this purpose
 * because it became a requirement for all tables to have the same physical
 * contents on both the primary and mirror segments.  To resolve this the
 * information is now passed to each segment on startup based on the
 * gp_segment_configuration (stored on the master only), and each segment
 * has a file in its datadirectory (gp_dbid) that uniquely identifies the
 * segment.
 *
 *   The contents of the table are now irrelevant, with the exception that
 * several tools began relying on this table for use as a method of remote
 * function invocation via gp_dist_random('gp_id') due to the fact that this
 * table was guaranteed of having exactly one row on every segment.  The
 * contents of the row have no defined meaning, but this property is still
 * relied upon.
 */
#ifndef _GP_ID_H_
#define _GP_ID_H_  

#include "catalog/genbki.h"
/*
 * Defines for gp_id table
 */
#define GpIdRelationName                        "gp_id"  

/* TIDYCAT_BEGINFAKEDEF  

   CREATE TABLE gp_id
   with (shared=true, oid=false, relid=5001, content=SEGMENT_LOCAL)
   (
   gpname       name     ,
   numsegments  smallint ,
   dbid         smallint ,
   content      smallint
   );  

   TIDYCAT_ENDFAKEDEF
*/

查询数据库大小的GP函数

postgres=# \df+ pg_database_size
                                                                                                     List of functions
   Schema   |       Name       | Result data type | Argument data types |  Type  |  Data access   | Volatility |  Owner   | Language |      Source code      |                         Description
------------+------------------+------------------+---------------------+--------+----------------+------------+----------+----------+-----------------------+-------------------------------------------------------------
 pg_catalog | pg_database_size | bigint           | name                | normal | reads sql data | volatile   | dege.zzz | internal | pg_database_size_name | Calculate total disk space usage for the specified database
 pg_catalog | pg_database_size | bigint           | oid                 | normal | reads sql data | volatile   | dege.zzz | internal | pg_database_size_oid  | Calculate total disk space usage for the specified database
(2 rows)

其中pg_database_size_name 的源码如下:
很明显,在统计数据库大小时也用到了select sum(pg_database_size('%s'))::int8 from gp_dist_random('gp_id');

Datum
pg_database_size_name(PG_FUNCTION_ARGS)
{
        int64           size = 0;
        Name            dbName = PG_GETARG_NAME(0);
        Oid                     dbOid = get_database_oid(NameStr(*dbName));  

        if (!OidIsValid(dbOid))
                ereport(ERROR,
                                (errcode(ERRCODE_UNDEFINED_DATABASE),
                                 errmsg("database \"%s\" does not exist",
                                                NameStr(*dbName))));  

        size = calculate_database_size(dbOid);  

        if (Gp_role == GP_ROLE_DISPATCH)
        {
                StringInfoData buffer;  

                initStringInfo(&buffer);  

                appendStringInfo(&buffer, "select sum(pg_database_size('%s'))::int8 from gp_dist_random('gp_id');", NameStr(*dbName));  

                size += get_size_from_segDBs(buffer.data);
        }  

        PG_RETURN_INT64(size);
}

不信我们可以直接查询这个SQL,和使用pg_database_size函数得到的结果几乎是一样的,只差了calculate_database_size的部分。

postgres=# select sum(pg_database_size('postgres'))::int8 from gp_dist_random('gp_id');
      sum
----------------
 16006753522624
(1 row)  

postgres=# select pg_database_size('postgres');
 pg_database_size
------------------
   16006763924106
(1 row)

gp_dist_random('gp_id')本质上就是在所有节点查询gp_id,
gp_dist_random('pg_authid')就是在所有节点查询pg_authid,
例如:

postgres=# select * from gp_dist_random('gp_id');
  gpname   | numsegments | dbid | content
-----------+-------------+------+---------
 Greenplum |          -1 |   -1 |      -1
 Greenplum |          -1 |   -1 |      -1
 Greenplum |          -1 |   -1 |      -1
 Greenplum |          -1 |   -1 |      -1
 Greenplum |          -1 |   -1 |      -1
 Greenplum |          -1 |   -1 |      -1
 Greenplum |          -1 |   -1 |      -1
 Greenplum |          -1 |   -1 |      -1
 Greenplum |          -1 |   -1 |      -1
 Greenplum |          -1 |   -1 |      -1
。。。。。。

如果不想返回太多记录,可以使用limit 来过滤,但是执行还是会在所有的segment都执行,如下:

postgres=# explain analyze select random() from gp_dist_random('gp_id') limit 1;
                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.04 rows=1 width=0)
   Rows out:  1 rows with 5.865 ms to first row, 5.884 ms to end, start offset by 4212 ms.
   ->  Gather Motion 240:1  (slice1; segments: 240)  (cost=0.00..0.04 rows=1 width=0)
         Rows out:  1 rows at destination with 5.857 ms to end, start offset by 4212 ms.
         ->  Limit  (cost=0.00..0.02 rows=1 width=0)
               Rows out:  Avg 1.0 rows x 240 workers.  Max 1 rows (seg0) with 0.062 ms to first row, 0.063 ms to end, start offset by 4228 ms.
               ->  Seq Scan on gp_id  (cost=0.00..4.00 rows=1 width=0)
                     Rows out:  Avg 1.0 rows x 240 workers.  Max 1 rows (seg0) with 0.060 ms to end, start offset by 4228 ms.
 Slice statistics:
   (slice0)    Executor memory: 463K bytes.
   (slice1)    Executor memory: 163K bytes avg x 240 workers, 163K bytes max (seg0).
 Statement statistics:
   Memory used: 128000K bytes
 Total runtime: 4288.007 ms
(14 rows)
时间: 2024-11-05 19:28:07

Greenplum通过gp_dist_random('gp_id') 在所有节点调用某个函数的相关文章

hadoop-Hadoop节点调用 mapreduce

问题描述 Hadoop节点调用 mapreduce 我在一个节点上写了map reduce 函数后,其它的节点要得到这2个函数进行数据处理运算,他们怎么得到的呢? 谁知道请告诉我一下 ,很想知道内部过程.

WMB中HTTP Request 节点 调用WebService服务报错

问题描述 家好,小弟在配置WMB的SSL中,碰到了个问题,搞了好久,都搞不定,因此在此向大家求助,希望各位高手指点一下,谢谢了!我的WMB消息流中,有个HTTPRequest节点,该节点需要调用WebService服务,而该WebService服务已经配置成了https的访问方式,HTTPRequest节点调用该服务的时候报了错,错误内容如下:ExceptionListRecoverableExceptionFile:CHARACTER:F:buildslot1S800_PsrcDataFlow

Applet调用&#106avascript函数

函数 在做的的applet中用到了调用javascript程序调用方法如下:1 首先引入JSObject 包 // LiveConnect... for JavaScriptimport netscape.javascript.JSObject ;2 在调用js的地方用:try { // 创创建JSObjectJSObject.getWindow (this).eval ("javascript::window.open()")}catch (Exception ex) { 错误提示}

sql server 2005中使用apply对每行调用表值函数

server|函数 使用 APPLY 运算符可以为实现查询操作的外部表表达式返回的每个行调用表值函数.表值函数作为右输入,外部表表达式作为左输入.通过对右输入求值来获得左输入每一行的计算结果,生成的行被组合起来作为最终输出.APPLY 运算符生成的列的列表是左输入中的列集,后跟右输入返回的列的列表. <?XML:NAMESPACE PREFIX = O /> APPLY 有两种形式: CROSS APPLY 和 OUTER APPLY.CROSS APPLY 仅返回外部表中通过表值函数生成结果

JAVA/JSP学习系列之十三(Applet调用&amp;#106avascript函数)(译)

js|函数 下面的例子将说明在applet中去调用javascript函数 . 一.applet源代码(tmin_JS.java) // Importationsimport java.awt.Graphics ;import java.awt.Event ; // LiveConnect... for JavaScriptimport netscape.javascript.JSObject ; public class tmin_JS extends java.applet.Applet {

C#中调用API函数RegisterHotKey注册多个系统热键

函数 要设置快捷键必须使用user32.dll下面的两个方法. BOOL RegisterHotKey( //注册系统热键的API函数 HWND hWnd, int id, UINT fsModifiers, UINT vk );  BOOL UnregisterHotKey( //删除系统热键的API函数 HWND hWnd, int id );  在C#中引用命名空间System.Runtime.InteropServices;来加载非托管类user32.dllusing System;us

《Effective C++》读书笔记09:绝不在构造和析构过程中调用virtual函数

首先明确一下,对于一个继承体系,构造函数是从基类开始调用了,而析构函数则正 好相反,从最外层的类开始. 对于在构造函数中调用virtual函数,先举个例子: 1 class Transaction //所有交易的基类 2 { 3 public: 4 Transaction(); 5 virtual void logTransaction() const = 0;//日志记 录,因交易类型的不同而有不同的记录 6 } 7 8 Transaction::Transaction()//构造函数实现 9

调用虚拟函数、持续化视图状态及POD类型概念

在 C++ 中,无法从某个类的构造函数中调用派生的虚拟函数,因为虚表还没有完全建立.但是在C#中好像就可以,是这样吗?为什么会有这种差别呢? 确实如此,在这个方面 C# 与 C++ 是有差别的.在 C++ 中,如果你从构造函数或者析构函数中调用虚拟函数,编译器调用的虚拟函数是定义在这个正在被构造的类实例中的(例如,如果从 Base::Base 中调用 Base::SomeVirtFn ),不是最底层派生的实例(the most derived instance),正像你说的那样,因为在最底层派生

JS随机调用指定函数的方法

  本文实例讲述了JS随机调用指定函数的方法.分享给大家供大家参考.具体如下: 本代码通过随机定时器调用指定函数,可达到间隔随机时间之行指定的函数的目的 ? 1 2 3 4 5 6 7 8 9 10 11 12 13 function randRange(data) { var newTime = data[Math.floor(data.length * Math.random())]; return newTime; } function toggleSomething() { var ti