PostgreSQL Oracle 兼容性之 - COMPOSE , UNISTR , DECOMPOSE

背景

参考
http://www.th7.cn/db/Oracle/2011-06-30/8490.shtml

很多语言，包括英语在内，都使用沉音字符（accented character）。

因为这些字符不属于 ASCII 字符集，所以假如不查看 Unicode 值也不使用 Unicode 编辑器并将其转成一个已知字符集，就很难编写使用这些字符的代码。

Oracle9i 引入了 COMPOSE 函数，该函数接受一串 Unicode 字符并规则化其文本。

这就意味着它可以接受一个字母和一个组合标记，比如说‘a'（Unicode 字符0097）和沉音符（Unicode 字符0300），然后创建一个单独的由两个标记组合而成的字符。

COMPOSE 使用非凡的组合标记，而没有使用 ASCII 中相应的音节标记，它所使用的非凡的组合标记是 Unicode 标准的一部分。上面的例子的结果应该是 Unicode 字符00E0（有一个沉音符的小写拉丁字母‘a'）。

在 ANSI 中最常见的组合字符有： U+0300：沉音符（grave accent）( ` )。 U+0301：重音符（acute accent）( ' )。 U+0302：抑扬音符号（circumflex accent）(^)。 U+0303：颚化符号（tilde）(~)。 U+0308：元音变音 ?。

假如没有非凡的软件或者键盘驱动程序的话，很难在键盘上输入 Unicode 字符0097和0300。因此，以纯 ASCII 文本输入 Unicode 序列的一个方法是使用 UNISTR 函数。

这个函数接受一个 ASCII 字符串然后以国家字符集（通常作为16位 Unicode 或者 UTF-8 字符集安装）创建一个 Unicode 字符的序列。

它使用十六进制占位符序列映射任何非 ASCII 字符，映射方式与 Java 类似。

要输入a后接一个沉音符组合字符的序列，可以使用 UNISTR(‘a300')，而不要试图直接在代码中输入字符。

这个函数在任何字符集以及任何具有基于 Unicode 的国家字符集的数据库下都可以正常运行。

可以将多个组合字符放在函数中――可以在 UNISTR 函数中混合使用 ASCII 和 Unicode 占位符。

例如，可以像下面这样使用 UNISTR 函数：

select COMPOSE(UNISTR('Unless you are nai308ve, meet me at the cafe301 with your re301sume301.')) from dual;

在将 UNISTR 函数的输出与 COMPOSE 组合时，可以在不查找任何值的情况下生成一个 Unicode 字符。
例如：

select 'it is true' if compose(unistr('a300')) = unistr('0e0');

UNISTR用法
输入编码得到unicode编码的字符
http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions204.htm
UNISTR(string)

UNISTR takes as its argument a text literal or an expression that resolves to character data and returns it in the national character set.
The national character set of the database can be either AL16UTF16 or UTF8.
UNISTR provides support for Unicode string literals by letting you specify the Unicode encoding value of characters in the string.
This is useful, for example, for inserting data into NCHAR columns.

The Unicode encoding value has the form '\xxxx' where 'xxxx' is the hexadecimal value of a character in UCS-2 encoding format.
Supplementary characters are encoded as two code units, the first from the high-surrogates range (U+D800 to U+DBFF), and the second from the low-surrogates range (U+DC00 to U+DFFF).
To include the backslash in the string itself, precede it with another backslash (\\).

For portability and data preservation, Oracle recommends that in the UNISTR string argument you specify only ASCII characters and the Unicode encoding values.

SELECT UNISTR('abc\00e5\00f1\00f6') FROM DUAL;

UNISTR
------
abcåñö

COMPOSE用法
将两个unicode编码的字符合成，例如字母与沉音符合成为另一个UNICODE字符
http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions025.htm
COMPOSE(char)

COMPOSE takes as its argument a string, or an expression that resolves to a string, in any datatype, and returns a Unicode string in its fully normalized form in the same character set as the input.
char can be any of the datatypes CHAR, VARCHAR2, NCHAR, NVARCHAR2, CLOB, or NCLOB. For example, an o code point qualified by an umlaut code point will be returned as the o-umlaut code point.

CLOB and NCLOB values are supported through implicit conversion. If char is a character LOB value, it is converted to a VARCHAR value before the COMPOSE operation.
The operation will fail if the size of the LOB value exceeds the supported length of the VARCHAR in the particular development environment.

SELECT COMPOSE ( 'o' || UNISTR('\0308') ) FROM DUAL; 

CO
--
ö

DECOMPOSE用法
将带有合成字符的字符串，解析成合成前的UNICODE字符串
http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions041.htm
DECOMPOSE(string)

DECOMPOSE is valid only for Unicode characters. DECOMPOSE takes as its argument a string in any datatype and returns a Unicode string after decomposition in the same character set as the input.
For example, an o-umlaut code point will be returned as the "o" code point followed by an umlaut code point.

SELECT DECOMPOSE ('Châteaux') FROM DUAL; 

DECOMPOSE
---------
Cha^teaux