zabbix template pg_monz for PostgreSQL / iVAN

[root@db-172-16-3-221 pg_monz]# grep -r PGSCRIPTDIR *
pg_monz_template.xml:                    <key>db_table.list.discovery[{$PGHOST},{$PGPORT},{$PGROLE},{$PGDATABASE},{$PGSCRIPTDIR}]</key>
pg_monz_template.xml:                    <key>db.list.discovery[{$PGHOST},{$PGPORT},{$PGROLE},{$PGDATABASE},{$PGSCRIPTDIR}]</key>
pg_monz_template.xml:                    <macro>{$PGSCRIPTDIR}</macro>

显然是用于调用find_dbname.sh和find_dbname_table.sh的.

PGLOGDIR则是postgresql log的目录.

[root@db-172-16-3-221 pg_monz]# grep -r PGLOGDIR *
pg_monz_template.xml:                    <key>logrt["{$PGLOGDIR}/postgresql-.*\.log","PANIC|FATAL|ERROR|[Ee]rror"]</key>
pg_monz_template.xml:                    <macro>{$PGLOGDIR}</macro>

但是LOG目录可能不同的数据库配置不一样, 所以你可以在所有的数据库服务器建立一个软链接, 来解决这个问题.

例如 :

ln -s $PGDATA/pg_log /var/pg_log
{$PGLOGDIR} = /var/pg_log  (统一使用这个)

PGROLE可能需要PG超级用户, 或者是可以执行本监控模板中所有SQL的用户.

如果需要密码的话, 建议配置.pgpass.

7. 下载监控脚本以及配置文件(zabbix agent 端)

https://github.com/pg-monz/pg_monz/blob/master/pg_monz/find_dbname.sh

https://github.com/pg-monz/pg_monz/blob/master/pg_monz/find_dbname_table.sh

https://github.com/pg-monz/pg_monz/blob/master/pg_monz/userparameter_pgsql.conf

脚本需要放到模板宏PGSCRIPTDIR的制定位置.

并配置可执行权限.

[root@db-172-16-3-221 pg_monz]# cp find_dbname.sh find_dbname_table.sh /usr/local/bin/
[root@db-172-16-3-221 pg_monz]# chown postgres:postgres /usr/local/bin/find_dbname*.sh
[root@db-172-16-3-221 pg_monz]# chmod 555 /usr/local/bin/find_dbname*.sh

简单的来看一个脚本:

find_dbname.sh, 用来输出PostgreSQL的所有数据库 :

[root@db-172-16-3-221 pg_monz]# cat find_dbname.sh
#!/bin/bash

# Get list of Database Name which you want to monitor.
# The default settings are excepted template databases(template0/template1).
#
# :Example
#
# If you want to monitor "foo" and "bar" databases, you set the GETDB as
# GETDB="select datname from pg_database where datname in ('foo','bar');"

GETDB="select datname from pg_database where datistemplate = 'f';"

for dbname in $(psql -h $1 -p $2 -U $3 -d $4 -t -c "${GETDB}"); do
    dblist="$dblist,"'{"{#DBNAME}":"'$dbname'"}'
done
echo '{"data":['${dblist#,}' ]}'

[root@db-172-16-3-221 pg_monz]# export PATH=/opt/pgsql/bin:$PATH
[root@db-172-16-3-221 pg_monz]# . ./find_dbname.sh 127.0.0.1 5432 postgres postgres
{"data":[{"{#DBNAME}":"postgres"},{"{#DBNAME}":"digoal"} ]}

find_dbname_table.sh, 用来输出数据库对应的表.

在得到数据库和表之后, 可以对数据库, 表进行监控查询.

8. 配置zabbix agent端.

8.1. 修改数据库的pg_log, 到/var/pg_log, (和PGLOGDIR一致即可).

另外需要确认其他的宏和被监控的数据库一致.

[root@db-172-16-3-221 pg_monz]# ln -s /data01/pgdata/pg_root/pg_log /var/

8.2. 可能需要添加.pgpass, 配合模板需要的连接, 不需要密码访问.

8.3. 将userparameter_pgsql.conf拷贝到zabbix agent配置目录.

[root@db-172-16-3-221 pg_monz]# cp userparameter_pgsql.conf /opt/zabbix/etc/zabbix_agentd.conf.d/

这里的监控项如下 :

[root@db-172-16-3-221 zabbix_agentd.conf.d]# cat userparameter_pgsql.conf
# PostgreSQL user parameter

#
# Server specific examples
#
# Get the total number of committed transactions
UserParameter=psql.tx_committed[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select sum(xact_commit) from pg_stat_database"
# Get the total number of rolled back transactions
UserParameter=psql.tx_rolledback[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select sum(xact_rollback) from pg_stat_database"
# Max Connections
UserParameter=psql.server_maxcon[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "show max_connections"
# PostgreSQL is running
UserParameter=psql.running[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select 1" > /dev/null 2>&1 ; echo $?

# Added by SRA OSS
# Get number of checkpoint count (by checkpoint_timeout)
UserParameter=psql.checkpoints_timed[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select checkpoints_timed from pg_stat_bgwriter"
# Get number of checkpoint count (by checkpoint_segments)
UserParameter=psql.checkpoints_req[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select checkpoints_req from pg_stat_bgwriter"
# Get the total number of connections
UserParameter=psql.server_connections[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select count(*) from pg_stat_activity;"
# Get the total number of active (on processing SQL) connections
UserParameter=psql.active_connections[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select count(state) from pg_stat_activity where state = 'active'"
# Get the total number of idle connections
UserParameter=psql.idle_connections[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select count(state) from pg_stat_activity where state = 'idle'"
# Get the total number of idle in transaction connections
UserParameter=psql.idle_tx_connections[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select count(state) from pg_stat_activity where state = 'idle in transaction'"
# Get the total number of lock-waiting connections
UserParameter=psql.locks_waiting[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select count(*) from pg_stat_activity where waiting = 't'"

# Get buffer information
UserParameter=psql.buffers_checkpoint[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select buffers_checkpoint from pg_stat_bgwriter"
UserParameter=psql.buffers_clean[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select buffers_clean from pg_stat_bgwriter"
UserParameter=psql.maxwritten_clean[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select maxwritten_clean from pg_stat_bgwriter"
UserParameter=psql.buffers_backend[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select buffers_backend from pg_stat_bgwriter"
UserParameter=psql.buffers_backend_fsync[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select buffers_backend_fsync from pg_stat_bgwriter"
UserParameter=psql.buffers_alloc[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select buffers_alloc from pg_stat_bgwriter"

# Get number of slow queries
UserParameter=psql.slow_queries[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select count(*) from pg_stat_activity where state = 'active' and now() - query_start > '$5 sec'::interval"
UserParameter=psql.slow_select_queries[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select count(*) from pg_stat_activity where state = 'active' and now() - query_start > '$5 sec'::interval and query ilike 'select%'"
UserParameter=psql.slow_dml_queries[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select count(*) from pg_stat_activity where state = 'active' and now() - query_start > '$5 sec'::interval and query ~* '^(insert|update|delete)'"

#
# Database specific examples
#
# Get the size of a Database (in bytes)
UserParameter=psql.db_size[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select pg_database_size('$5') from pg_database where datname = '$5'"
# Get number of active connections for a specified database
UserParameter=psql.db_connections[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select numbackends from pg_stat_database where datname = '$5'"
# Get number of tuples returned for a specified database
UserParameter=psql.db_returned[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select tup_returned from pg_stat_database where datname = '$5'"
# Get number of tuples fetched for a specified database
UserParameter=psql.db_fetched[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select tup_fetched from pg_stat_database where datname = '$5'"
# Get number of tuples inserted for a specified database
UserParameter=psql.db_inserted[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select tup_inserted from pg_stat_database where datname = '$5'"
# Get number of tuples updated for a specified database
UserParameter=psql.db_updated[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select tup_updated from pg_stat_database where datname = '$5'"
# Get number of tuples deleted for a specified database
UserParameter=psql.db_deleted[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select tup_deleted from pg_stat_database where datname = '$5'"
# Get number of committed/rolled back transactions for a specified database
UserParameter=psql.db_tx_committed[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select xact_commit from pg_stat_database where datname = '$5'"
UserParameter=psql.db_tx_rolledback[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select xact_rollback from pg_stat_database where datname = '$5'"

# Cache Hit Ratio
UserParameter=psql.cachehit_ratio[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "SELECT round(blks_hit*100/(blks_hit+blks_read), 2) AS cache_hit_ratio FROM pg_stat_database WHERE datname = '$5' and blks_read > 0 union all select 0.00 AS cache_hit_ratio order by cache_hit_ratio desc limit 1"

# Added by SRA OSS
# Get number of temp files
UserParameter=psql.db_temp_files[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select temp_files from pg_stat_database where datname = '$5'"
# Get temp file size (in bytes)
UserParameter=psql.db_temp_bytes[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select temp_bytes from pg_stat_database where datname = '$5'"
# Get percentage of dead tuples of all tables for a specified database
UserParameter=psql.db_dead_tup_ratio[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select round(sum(n_dead_tup)*100/sum(n_live_tup+n_dead_tup), 2) as dead_tup_ratio from pg_stat_all_tables where n_live_tup > 0"
# Get number of deadlocks for a specified database (9.2 or later)
UserParameter=psql.db_deadlocks[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select deadlocks from pg_stat_database where datname = '$5'"

#
# Table specific examples
#
# Get table cache hit ratio of a specific table
UserParameter=psql.table_cachehit_ratio[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select round(heap_blks_hit*100/(heap_blks_hit+heap_blks_read), 2) as cache_hit_ratio from pg_statio_user_tables where schemaname = '$5' and relname = '$6' and heap_blks_read > 0 union all select 0.00 as cache_hit_ratio order by cache_hit_ratio desc limit 1"
# Get number of sequencial scan of a specific table
UserParameter=psql.table_seq_scan[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select seq_scan from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
# Get number of index scan of a specific table
UserParameter=psql.table_idx_scan[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select coalesce(idx_scan,0) from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
# Get number of vacuum count of a specific table
UserParameter=psql.table_vacuum_count[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select vacuum_count from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
# Get number of analyze count of a specific table
UserParameter=psql.table_analyze_count[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select analyze_count from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
# Get number of autovacuum count of a specific table
UserParameter=psql.table_autovacuum_count[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select autovacuum_count from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
# Get number of autoanalyze count of a specific table
UserParameter=psql.table_autoanalyze_count[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select autoanalyze_count from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"

# Get number of tuples of a specific table
UserParameter=psql.table_n_tup_ins[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select n_tup_ins from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
UserParameter=psql.table_n_tup_upd[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select n_tup_upd from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
UserParameter=psql.table_n_tup_del[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select n_tup_del from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
UserParameter=psql.table_seq_tup_read[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select seq_tup_read from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
UserParameter=psql.table_idx_tup_fetch[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select coalesce(idx_tup_fetch,0) from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
UserParameter=psql.table_n_tup_hot_upd[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select n_tup_hot_upd from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
UserParameter=psql.table_n_live_tup[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select n_live_tup from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
UserParameter=psql.table_n_dead_tup[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select n_dead_tup from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"

#
# Discovery Rule
#
# Database Discovery
UserParameter=db.list.discovery[*],$5/find_dbname.sh $1 $2 $3 $4
UserParameter=db_table.list.discovery[*],$5/find_dbname_table.sh $1 $2 $3 $4

8.4. 配置zabbix_agentd.conf (被动监控)

[root@db-172-16-3-221 etc]# cd /opt/zabbix/etc/
[root@db-172-16-3-221 etc]# ll
total 20
-rw-r--r-- 1 root root 2485 Sep  1 14:45 zabbix_agent.conf
drwxr-xr-x 2 root root 4096 Sep 22 14:24 zabbix_agent.conf.d
-rw-r--r-- 1 root root 7456 Sep  1 17:19 zabbix_agentd.conf
drwxr-xr-x 2 root root 4096 Sep  1 14:45 zabbix_agentd.conf.d

[root@db-172-16-3-221 etc]# vi zabbix_agentd.conf
Include=/opt/zabbix/etc/zabbix_agentd.conf.d/
PidFile=/tmp/zabbix_agentd.pid
LogFile=/tmp/zabbix_agentd.log
LogFileSize=10
EnableRemoteCommands=1
LogRemoteCommands=1
Server=172.16.3.150
ListenPort=10050
ListenIP=0.0.0.0
StartAgents=8
ServerActive=172.16.3.150:10052
Hostname=agent221
RefreshActiveChecks=60

8.5. 启动agentd

[root@db-172-16-3-221 etc]# su - postgres -c "/opt/zabbix/sbin/zabbix_agentd -c /opt/zabbix/etc/zabbix_agentd.conf"

9. 创建监控主机, 或配置已监控的主机添加模板.