海量数据迁移之外部表并行抽取

在10g开始的新特性中，外部表是一个不容忽视的好工具。对于大型项目中海量数据使用sqlloader是一种全新的方式，不过很明显,sqlloader的可扩展性更强，但是基于oracle平台的数据迁移来说，外部表的性能也不错。对于数据迁移来说也是一个很好的方案。
使用外部表来做数据迁移，可以“动态”加载数据，能够很方便的从数据库中加载数据，对于数据校验来说就显得很有优势了，而对于sqlloader来说，可能得等到数据加载的时候才知道是不是有问题，如果对于数据的准确性要求极高，可以使用外部表动态加载数据到备库，和现有的数据做比对，减少在升级过程中带来的灾难。
还有关于数据类型，对于clob,blob的加载，大家都比较头疼，在sqlloader中可能需要做一些额外的工作，来外部表中就和操作普通的表没有什么区别。

先来说说数据抽取的部分。
一下是我今天完成的部分脚本，目录结构如下。
drwxr-xr-x 2 ora11g dba 4096 Jun 9 22:14 DUMP_LIST
drwxr-xr-x 2 ora11g dba 4096 Jun 9 23:25 extract
drwxr-xr-x 2 ora11g dba 4096 Jun 9 22:32 parfile
[ora11g@rac1 ext_datapump]$ pwd
/u01/ora11g/test/ext_datapump

对于一些比较大的表，如果占用的空间在好几十个G左右的时候，生成一个巨大的dump文件就有问题了，一来是关于io，顺序的写入dump文件，而且在加载的时候也没有其他的选择了，只能从头到尾一步一步来。
我的设想就是如果一个表有100G，可以把他切分为200个dump文件，每个500M,或者说生成1000个dump文件，每个dump文件100M，这样在加载的时候就可以很清楚的看到目前数据加载的进度了。
我使用了如下的脚本来生成多个dump文件，
tablst=`cat ../parfile/tablst|awk '{print "'\''" $1 "'\'',"}'`
#for segment within 50M, will use parallel 1, and parallel will calculated with segment_bytes_size_MB/50M
page=50
sqlplus -s n1/n1
set feedback off
set head off
set line 100
set pages 0
set long 10000
set termout off
col segment_name for a40
col parallel format 9999
spool tab_parall_temp.lst
select distinct segment_name,ceil(sum(bytes/1024/1024)/50) parallel from user_segments
where segment_name in (select table_name from user_tables where table_name in ($tablst'x') and table_name not in (select table_name from user_external_tables))
group by segment_name order by 2 desc;
spool off;
EOF
sed '/^$/d' tab_parall_temp.lst |sort > ../parfile/tab_parall.lst
rm tab_parall_temp.lst

运行脚本后，结果如下所示。
BIG_INSERT 1
CLOB_TEST 1
SMALL_INSERT 1
T 1
TEMP_TEST 1
TEST 1
TEST_DATA 1
TT 2
T_TEMP 1
可以看到表tt比较大，就需要分成了两个dump。加载的时候也可以分批加载。
当然，生成的dump的个数也和一个数据库参数密切相关，像我目前的库，最大的并行只有30. 意味着如果我要把一个表切分成50个dump，根据配置，只能最多切分为30个。
parallel_max_servers integer 30

在生面的基础上，可以采用下面的脚本来生成dump文件。
#### source owner $1
#### tab_name $2
#### target owner $3
#### dump directory $4

owner=`echo "$1"|tr '[a-z]' '[A-Z]'`
tab_name=`echo "$2"|tr '[a-z]' '[A-Z]'`
tablst=`cat ../parfile/tablst|awk '{print "'\''" $1 "'\'',"}'`

tmp_parallel=`grep -w $tab_name ../parfile/tab_parall.lst|awk '{print $2}'`
for i in {1..$tmp_parallel};
do
echo \'${tab_name}_$i.dmp\', >> tmp_${tab_name}_par_dmp.lst
done

sed -e '/^$/d' -e '$s/.$//' tmp_${tab_name}_par_dmp.lst > ../DUMP_LIST/${tab_name}_par_dmp.lst
dump_list=`cat ../DUMP_LIST/${tab_name}_par_dmp.lst`
#echo $dump_list
#echo $tmp_parallel
sqlplus -s n1/n1
set serveroutput on
DECLARE
TABLE_FLAG number(2);
begin
execute immediate 'select count(*) from all_tables where owner=upper(''$1'') and table_name=upper(''$2'')' into table_flag;
if(table_flag>0) then
dbms_output.put_line( 'TABLE '||'$tab_name'||' exists in owner account,proceed...');
else
dbms_output.put_line( 'TABLE does not exists in owner account,please check again');
return;
end if;
end;
/

set timing on
DECLARE
TABLE_FLAG number(2);
begin
execute immediate 'select count(*) from all_external_tables where owner=upper(''$3'') and table_name=upper(''$2'')' into table_flag;
if(table_flag>0) then
dbms_output.put_line( 'External table exists in owner account,proceed...');
execute immediate 'drop table $3.$2_ext';
end if;
end;
/
exec dbms_output.put_line('Get Dump file for $1.$2...');
create table $3.$2_ext
ORGANIZATION EXTERNAL
(TYPE ORACLE_DATAPUMP
DEFAULT DIRECTORY $4
LOCATION (
$dump_list
)
)
parallel $tmp_parallel
as
select /*+ parallel(t $tmp_parallel) */ * from $1.$2 t;
set feedback off;
set timing off
drop table $3.$2_ext;
EOF
rm tmp_${tab_name}_par_dmp.lst
exit

脚本运行效果如下：
Get Dump file for n1.TT...
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.01
Table created.
Elapsed: 00:00:03.07
TABLE T_TEMP exists in owner account,proceed...
PL/SQL procedure successfully completed.
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.01
Get Dump file for n1.T_TEMP...
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.00
Table created.
Elapsed: 00:00:01.39

生成的dump文件如下：
[ora11g@rac1 expdp]$ ll *.dmp
-rw-r----- 1 ora11g dba 466944 Jun 10 01:43 BIG_INSERT_1.dmp
-rw-r----- 1 ora11g dba 12288 Jun 10 01:43 CLOB_TEST_1.dmp
-rw-r----- 1 ora11g dba 40960 Jun 10 01:43 SMALL_INSERT_1.dmp
-rw-r----- 1 ora11g dba 30531584 Jun 10 01:43 T_1.dmp
-rw-r----- 1 ora11g dba 524288 Jun 10 01:43 TEMP_TEST_1.dmp
-rw-r----- 1 ora11g dba 466944 Jun 10 01:43 TEST_1.dmp
-rw-r----- 1 ora11g dba 69632 Jun 10 01:43 TEST_DATA_1.dmp
-rw-r----- 1 ora11g dba 39018496 Jun 10 01:43 TT_1.dmp
-rw-r----- 1 ora11g dba 43634688 Jun 10 01:43 TT_2.dmp
-rw-r----- 1 ora11g dba 30531584 Jun 10 01:43 T_TEMP_1.dmp

在比较同样的数据量的情况下，sqlloader可能需要大约120G的容量，而对于外部表来说，大概只需要40G左右的空间。

时间： 2024-10-23 20:16:28

海量数据迁移之外部表并行抽取

海量数据迁移之外部表并行抽取的相关文章

海量数据迁移之外部表切分

海量数据迁移之外部表加载

阿里云RDS PostgreSQL OSS 外部表 - 并行写提速案例

Greenplum insert的性能(单步\批量\copy) - 暨推荐使用gpfdist、阿里云oss外部表并行导入

海量数据迁移之传输表空间(一)

海量数据迁移之使用分区并行切分导入

海量数据迁移之分区并行切分

海量数据迁移之数据抽取流程

PostgreSQL · 实现分析 · PostgreSQL 10.0 并行查询和外部表的结合