PostgreSQL 数据采样与脱敏

背景

测试数据

postgres=# create table test(id int primary key, username text, phonenum text, addr text, pwd text, crt_time timestamp);
CREATE TABLE  

postgres=# insert into test select id, 'test_'||id, 13900000000+(random()*90000000)::int, '中国杭州xxxxxxxxxxxxxxxxxx'||random(), md5(random()::text), clock_timestamp() from generate_series(1,10000000) t(id);
INSERT 0 10000000  

postgres=# select * from test limit 10;
 id | username |  phonenum   |                    addr                     |               pwd                |          crt_time
----+----------+-------------+---------------------------------------------+----------------------------------+----------------------------
  1 | test_1   | 13950521974 | 中国杭州xxxxxxxxxxxxxxxxxx0.953363882377744 | 885723a5f4938808235c5debaab473ec | 2017-06-02 15:05:55.465132
  2 | test_2   | 13975998000 | 中国杭州xxxxxxxxxxxxxxxxxx0.91321265604347  | 7ea01dc02c0fbc965f38d1bf12b303eb | 2017-06-02 15:05:55.46534
  3 | test_3   | 13922255548 | 中国杭州xxxxxxxxxxxxxxxxxx0.846756176557392 | 7c2992bdc69312cbb3bb135dd2b98491 | 2017-06-02 15:05:55.46535
  4 | test_4   | 13985121895 | 中国杭州xxxxxxxxxxxxxxxxxx0.639280265197158 | 202e32f0f0e3fe669c00678f7acd2485 | 2017-06-02 15:05:55.465355
  5 | test_5   | 13982757650 | 中国杭州xxxxxxxxxxxxxxxxxx0.501174578908831 | b6a42fc1ebe9326ad81a81a5896a5c6c | 2017-06-02 15:05:55.465359
  6 | test_6   | 13903699864 | 中国杭州xxxxxxxxxxxxxxxxxx0.193029860965908 | f6bc06e5cda459d09141a2c93f317cf2 | 2017-06-02 15:05:55.465363
  7 | test_7   | 13929797532 | 中国杭州xxxxxxxxxxxxxxxxxx0.192601112183183 | 75c12a3f14c7ef3e558cef79d84a7e8e | 2017-06-02 15:05:55.465368
  8 | test_8   | 13961108182 | 中国杭州xxxxxxxxxxxxxxxxxx0.900682372972369 | 5df33d15cf7726f2fb57df3ed913b306 | 2017-06-02 15:05:55.465371
  9 | test_9   | 13978455210 | 中国杭州xxxxxxxxxxxxxxxxxx0.87795089604333  | cbe233f00cdd3c61c67415c1f8691846 | 2017-06-02 15:05:55.465375
 10 | test_10  | 13957044022 | 中国杭州xxxxxxxxxxxxxxxxxx0.410478914622217 | cdf2f98b0ff5a973efaca6a82625e283 | 2017-06-02 15:05:55.465379
(10 rows)

采样

9.5以前的版本，高效采样请参考

《PostgreSQL 巧妙的数据采样方法》

9.5以及以后的版本，可以使用tablesample语法进行采样（注意，采样过滤器在where条件过滤器的前面）。

语法如下

https://www.postgresql.org/docs/9.6/static/sql-select.html

TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]  

sampling_method指采样方法  

argument指参数，例如采样比例。  

REPEATABLE(seed) 指采样随机种子，如果种子一样，那么多次采样请求得到的结果是一样的。如果忽略REPEATABLE则每次都是使用新的seed值，得到不同的结果。

例子1，BERNOULLI(百分比)采样，使用全表扫描的采样方法，按采样参数百分比返回。

postgres=# select * from test TABLESAMPLE bernoulli (1);
   id    |   username   |  phonenum   |                      addr                      |               pwd                |          crt_time
---------+--------------+-------------+------------------------------------------------+----------------------------------+----------------------------
     110 | test_110     | 13967004360 | 中国杭州xxxxxxxxxxxxxxxxxx0.417577873915434    | 437e5c29e12cbafa0563332909436d68 | 2017-06-02 15:05:55.46585
     128 | test_128     | 13901119801 | 中国杭州xxxxxxxxxxxxxxxxxx0.63212554808706     | 973dba4b35057d44997eb4744eea691b | 2017-06-02 15:05:55.465938
     251 | test_251     | 13916668924 | 中国杭州xxxxxxxxxxxxxxxxxx0.0558807463385165   | 71217eedce421bd0f475c0e4e6eb32a9 | 2017-06-02 15:05:55.466423
     252 | test_252     | 13981440056 | 中国杭州xxxxxxxxxxxxxxxxxx0.457073447294533    | 6649c37c0f0287637a4cb80d84b6bde0 | 2017-06-02 15:05:55.466426
     423 | test_423     | 13982447202 | 中国杭州xxxxxxxxxxxxxxxxxx0.816960731055588    | 11a8d6d1374cf7565877def6a147f544 | 2017-06-02 15:05:55.46717
......

例子2，SYSTEM(百分比)采样，使用块级采样方法，按采样参数百分比返回（被采样到的数据块，内的所有记录都将被返回）。因此离散度不如BERNOULLI，但是效率高很多。

postgres=# select * from test TABLESAMPLE system (1);
   id    |   username   |  phonenum   |                      addr                      |               pwd                |          crt_time
---------+--------------+-------------+------------------------------------------------+----------------------------------+----------------------------
    6986 | test_6986    | 13921391589 | 中国杭州xxxxxxxxxxxxxxxxxx0.874497607816011    | e6a5d695aca17de0f6489d740750c758 | 2017-06-02 15:05:55.495697
    6987 | test_6987    | 13954425190 | 中国杭州xxxxxxxxxxxxxxxxxx0.374216149561107    | 813fffbf1ee7157c459839987aa7f4b0 | 2017-06-02 15:05:55.495721
    6988 | test_6988    | 13901878095 | 中国杭州xxxxxxxxxxxxxxxxxx0.624850326217711    | 5056caaad5e076f82b8caec9d02169f6 | 2017-06-02 15:05:55.495725
    6989 | test_6989    | 13940504557 | 中国杭州xxxxxxxxxxxxxxxxxx0.705925882328302    | a5b4062086a3261740c82774616e64ee | 2017-06-02 15:05:55.495729
    6990 | test_6990    | 13987358496 | 中国杭州xxxxxxxxxxxxxxxxxx0.981084300205112    | 6ba0b6c9d484e6fb90181dc86cb6598f | 2017-06-02 15:05:55.495734
    6991 | test_6991    | 13948658183 | 中国杭州xxxxxxxxxxxxxxxxxx0.6592857837677      | 9a0eadd056eeb6e3c1e2b984777cdf6b | 2017-06-02 15:05:55.495738
    6992 | test_6992    | 13934074866 | 中国杭州xxxxxxxxxxxxxxxxxx0.232706854119897    | 84f6649beac3b78a3a1afeb9c3aabccd | 2017-06-02 15:05:55.495741
......

用户还可以通过以下接口自定义采样方法

https://www.postgresql.org/docs/9.6/static/tablesample-method.html

脱敏

脱敏的手段很多，用户对脱敏的需求也可能很多。

常见的例如

1. 隐藏字符串中间的内容，使用*表示，同时保持原始长度

2. 隐藏字符串中间的内容，使用*表示，不保持原始长度

3. 返回加密值

不管什么需求，实际上就是数据的转换，从原始值，转换为目标值。在PostgreSQL中可以通过function实现这样的转换，对不同的需求，编写不同的转换逻辑即可。

例子，将字符串中间部分模糊化，只显示字符串头2个，末尾1个。

select id, substring(username,1,2)||'******'||substring(username,length(username),1),
substring(phonenum,1,2)||'******'||substring(phonenum, length(phonenum),1),
substring(addr,1,2)||'******'||substring(addr, length(addr),1),
substring(pwd,1,2)||'******'||substring(pwd, length(pwd),1),
crt_time
from test
TABLESAMPLE bernoulli (1);  

   id    | ?column?  | ?column?  |  ?column?   | ?column?  |          crt_time
---------+-----------+-----------+-------------+-----------+----------------------------
      69 | te******9 | 13******5 | 中国******9 | c0******2 | 2017-06-02 15:32:26.261624
     297 | te******7 | 13******2 | 中国******1 | d9******6 | 2017-06-02 15:32:26.262558
     330 | te******0 | 13******5 | 中国******3 | bd******0 | 2017-06-02 15:32:26.262677
     335 | te******5 | 13******5 | 中国******6 | 08******f | 2017-06-02 15:32:26.262721
     416 | te******6 | 13******6 | 中国******2 | b3******d | 2017-06-02 15:32:26.26312
     460 | te******0 | 13******4 | 中国******8 | e5******f | 2017-06-02 15:32:26.26332
     479 | te******9 | 13******1 | 中国******1 | 1d******4 | 2017-06-02 15:32:26.263393
     485 | te******5 | 13******0 | 中国******3 | a3******8 | 2017-06-02 15:32:26.263418
     692 | te******2 | 13******9 | 中国******4 | 69******8 | 2017-06-02 15:32:26.264326
    1087 | te******7 | 13******9 | 中国******3 | 8e******5 | 2017-06-02 15:32:26.266091
    1088 | te******8 | 13******8 | 中国******7 | 37******e | 2017-06-02 15:32:26.266095
    1116 | te******6 | 13******8 | 中国******2 | 4c******3 | 2017-06-02 15:32:26.266235
    1210 | te******0 | 13******4 | 中国******8 | 49******c | 2017-06-02 15:32:26.266671
......

如果需要更复杂的转换，写PostgreSQL的UDF对字段值进行转换即可。

将采样结果抽取到其他平台的方法也很多，例如copy到stdout，或者ETL工具等。

例子

psql test -c "copy (select id, substring(username,1,2)||'******'||substring(username,length(username),1),
substring(phonenum,1,2)||'******'||substring(phonenum, length(phonenum),1),
substring(addr,1,2)||'******'||substring(addr, length(addr),1),
substring(pwd,1,2)||'******'||substring(pwd, length(pwd),1),
crt_time
from test
TABLESAMPLE bernoulli (1)
) to stdout" > ./sample_test.log  

less sample_test.log
54      te******4       13******4       中国******3     52******b       2017-06-02 15:32:26.261451
58      te******8       13******6       中国******3     23******a       2017-06-02 15:32:26.261584
305     te******5       13******6       中国******9     c0******4       2017-06-02 15:32:26.262587
399     te******9       13******5       中国******4     71******7       2017-06-02 15:32:26.26298
421     te******1       13******0       中国******4     21******3       2017-06-02 15:32:26.263139
677     te******7       13******5       中国******5     e2******7       2017-06-02 15:32:26.264269
874     te******4       13******9       中国******2     a6******9       2017-06-02 15:32:26.265159

参考

《PostgreSQL 巧妙的数据采样方法》

https://www.postgresql.org/docs/9.6/static/tablesample-method.html

https://www.postgresql.org/docs/9.6/static/sql-select.html

时间： 2024-08-29 10:52:57

PostgreSQL 数据采样与脱敏的相关文章

PostgreSQL 数据文件灾难恢复 - 解析与数据dump

标签 PostgreSQL , 数据文件 , pg_filedump , 安全 , TDE 背景俗话说常在河边站哪有不湿鞋,作为一名战斗在一线的DBA或者开发者,可能有遇到过磁盘损坏,磁盘阵列损坏,如果有备份或者备库的话,还好. 如果没有备份,或者没有备库(通常有一些小型或者创业型的企业),那么遇到磁盘损坏或者其他原因(比如掉电文件系统损坏),导致数据库的数据文件并不完整时,如何从有限的资料中找出数据呢? 比如PostgreSQL,如果读到坏块,会报块不可读的错误,这种情况下通过设置zero_

PostgreSQL数据保留窗口功能的使用

标签 PostgreSQL , 保留窗口 , ttl , stream , continuous view , pipelinedb 背景类似mongodb的rotate collate(设置表的上限容量.上限记录数,持续写入,自动覆盖最老的记录),PostgreSQL通过pipelinedb也能实现类似的功能. 此类功能非常适合日志数据,无需维护成本,持续写入,自动覆盖最老的记录. pipelinedb计划会在2017-07月份转换为postgresql插件,届时使用会更加便利. 如果不使用

PostgreSQL 数据rotate用法介绍

标签 PostgreSQL , 按时间覆盖历史数据背景在某些业务场景中,数据有冷热之分,例如业务只关心最近一天.一周或者一个月的数据.对于历史的数据可以丢弃. 比如某些指标的监控场景,保留一周的监控数据,历史的都可以丢弃. 如何丢弃历史数据?或者说如何实现rotate? 1. 使用delete, 删除7天前的数据. delete from table where crt_time<=now()-interval '7 day'; 这种方法会带来额外的开销,包括写REDO日志,垃圾回收等.如果

TDW与PostgreSQL数据互访问功能

一.开源项目TDW介绍腾讯分布式http://www.aliyun.com/zixun/aggregation/8302.html">数据仓库 ( Tencent distributed Data Warehouse,以下简称TDW) 是腾讯工程技术事业群数据平台部基于开源软件研发的大数据处理平台,它基于Hadoop.Hive.PostgreSQL之上进行研发,并在开源软件的基础上做了大量的定制和优化.目前TDW是腾讯内部规模最大的分布式系统,集中了腾讯内部各个产品的数据,为腾讯的各个产

PostgreSQL 数据访问 offset 的质变 case

背景 offset limit是一个多么常见的需求啊,但是你知道offset的数据可能隐藏着质变吗? 如图 node有30W条数据,其中前100条是满足条件的,然后100条到20W条都是不满足条件的.所以offset 10 limit 10非常的快.但是offset 100 limit 10,就要扫描从100到20W条记录,然后再往后才是满足条件的记录.这就是质变的原因. 例子生成1000万测试记录. postgres=# create table tbl(id int primary key

PostgreSQL 数据去重大法

标签 PostgreSQL , 去重 , 单列去重 , 多列去重 , 行去重 , 多列混合去重 , varidict 参数 , 数组排序 , 数组元素重排背景去重的需求比较常见,去重也可以衍生出很多变种.例如 1. 单列去重,很好理解,就是按某列去除重复记录.保留规则(例如保留最新的,保留最旧的,或者保留某个其他字段最大的). 2. 多列去重,按多列,去除重复记录.保留规则(例如保留最新的,保留最旧的,或者保留某个其他字段最大的). 3. 行去重,按行,去除重复记录.保留规则(例如保留最新的

TB级大表秒级任意维度分析 - 采样估值满足高效TOP N等分析需求

标签 PostgreSQL , 采样 , sample , TOP N , 统计分析背景估值计算是统计学的常用手段.因为数据量庞大,求精确数值需要耗费巨大的资源,而统计分析并不要求完全精确的数据,因此估值计算是一种折中的方法,广泛应用于统计分析场景. PostgreSQL是一个功能强大的数据库,在估值统计方面,提供了很多方法. 1.PostgreSQL中,求估计的UV,增量UV等(即count distinct),可以通过HLL插件来实现. <Greenplum 最佳实践 - 估值插件hll

数据库案例集锦 - 开发者的《如来神掌》

背景「剑魔独孤求败,纵横江湖三十馀载,杀尽仇寇,败尽英雄,天下更无抗手,无可柰何,惟隐居深谷,以雕为友.呜呼,生平求一敌手而不可得,诚寂寥难堪也.」剑冢中,埋的是剑魔独孤求败毕生几个阶段中用过的几柄剑: 利剑无意:第一柄是青光闪闪的利剑,凌厉刚猛,无坚不摧,弱冠前以之与河朔群雄争锋. 软剑无常:第二柄是紫薇软剑,三十岁前所用,误伤义士不祥,悔恨不已,乃弃之深谷. 重剑无锋:第三柄是玄铁重剑,重剑无锋,大巧不工,四十岁之前恃之横行天下. 木剑无俦:第四柄是已腐朽的木剑. 无剑无招:四十岁后,不

PostgreSQL 如何快速构建海量逼真测试数据

标签 PostgreSQL , pgbench , 压测 , 变量 , 测试数据构建背景为了测试或验证需要,通常需要快速的构建测试数据. PostgreSQL提供了一些非常有用的功能,可以帮助用户快速的构建测试数据. 有趣的功能 1.SRF 返回多条记录的函数.例如 List of functions Schema | Name | Result data type | Argument data types | Type ------------+---------------------