Explanation of UFT-8 and Unicode

What is unicode?

  A mapping with characters and a index, we use u+xxxx to represent it.

Confuse with unicode and UTF-8?    Unicode is a standard char set, UTF-8 is one of implementation, just one of UCS-2, UCS-4 and so forth, but it becomes standard way of encoding. but note one thing, when we are talking about some english characters, those two standard are the same, it means

U-00000000 - U-0000007F:  0xxxxxxx

    sometimes, especially the programmer, since U-00000000 - U-0000007F is enough for their dialy use(26 english and some symbols), so, there is no different between the character set standards(unicode) and implementation standard(UTF-8) for them. when they are talking with you, you may confuse.

Why is UTF-8?    You may ask why not use UCS-4 or UCS-2? do people like 8 more(in cantonese, it means become rich)?        The answer is no. Using UCS-2 (or UCS-4) under Unix would lead to very severe problems. Strings with these encodings can contain as parts of many wide characters bytes like '\0' or '/' which have a special meaning in filenames and other C library function parameters.

(An ASCII or Latin-1 file can be transformed into a UCS-2 file by simply inserting a 0x00 byte in front of every ASCII byte. If we want to have a UCS-4 file, we have to insert three 0x00 bytes instead before every ASCII byte.)

    In addition, the majority of UNIX tools expects ASCII files and can't read 16-bit words as characters without major modifications.(In UTF-8U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility).

This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8)

------------prove the ASCII and UTF-8 are the same---------package unicode;public class CharTest {    public static void main(String[] args) throws Exception {        char[] chars = new char[]{'\u007F'};        String str = new String(chars);        System.out.println("within 0000 - 007F : " + str);       //for the character whose unicode less than u0080, it is no different      with encode by  //ISO-8859-1 or UTF-8. they are compatiable.        System.out.println("   UTF-8 - UTF-8      " + new String(str.getBytes("UTF-8"),

"ISO-8859-1"));         System.out.println("   ISO-8859-1 - UTF-8 " +new String(str.getBytes("ISO-8859-1"),

"UTF-8"));        chars = new char[]{'\u00F2'};        str = new String(chars); //The above principle can not apply to the character lager than 007F        System.out.println("out of 0000 - 007F : " + str);        System.out.println("   UTF-8 - UTF-8      "  + new String(str.getBytes("UTF-8"),

"ISO-8859-1"));        System.out.println("   ISO-8859-1 - UTF-8 "  + new String(str.getBytes("ISO-8859-

1"), "UTF-8"));    }}---------------------------------------------------------------------------------

How long is the UTF-8 encoding?    Theoretically, it can be 6 bytes, but actually, 3 byte is enough for us since BMP is not longer than 3(The most commonly used characters, including all those found in major older encoding standards, have been placed into the first plane (0x0000 to 0xFFFD), which is called the Basic Multilingual Plane (BMP))   

Important UTF-8 features:  1. UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.   2. All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00-0x7F) can appear as part of any other character.   3. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes. (?? the further investigate is necessary. can explain this currently)  4. All possible 231 UCS codes can be encoded.   5. UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP

characters are only up to three bytes long.   6. The bytes 0xFE and 0xFF are never used in the UTF-8 encoding. ------------Prove the features(1,2,3)-----------------package unicode;

public class UTF8Features {    public static void main(String[] args) throws Exception {        //Why not write some no-ASCII character in the src?        //Since it will depends on your system rather than        //a UTF-8 as your image        char[] chars = new char[]{'\u007F'};        String str = new String(chars);        System.out.println("Point 1 : " + str);        System.out.println("   UTF-8 - UTF-8      "                + new String(str.getBytes("UTF-8"), "ISO-8859-1"));        System.out.println("   ISO-8859-1 - UTF-8 "                + new String(str.getBytes("ISO-8859-1"), "UTF-8"));        System.out.println();

        chars = new char[]{'\uE840'};        str = new String(chars);        System.out.println("Point 2 : " + str);        //just a sample you can use this method to verify more characters        System.out.println("   No less than 7F      " + getHexString(str));

        chars = new char[]{'\u2260'};        str = new String(chars);        //just a sample you can use this method to verify more characters        System.out.println("Point 3 : " + str);        System.out.println("   Range of 1st Byte      " + getHexString(str));    }

    public static String getHexString(String num) throws Exception {        StringBuffer sb = new StringBuffer();        //You must specify UTF-8 here, else it will use the defaul encoding        //which depends on your enviroment        byte[] bytes = num.getBytes("UTF-8");        for (int i = 0; i < bytes.length; i++) {            sb.append(Integer.toHexString((bytes[i] >= 0 ?                     bytes[i] : 256 + bytes[i])).toUpperCase() + " ");        }        return sb.toString();    }}---------------------------------------------------------------------------------Pinciple of presenting a unicode use UTF-8:

U-00000000 - U-0000007F:  0xxxxxxx  U-00000080 - U-000007FF:  110xxxxx 10xxxxxx  U-00000800 - U-0000FFFF:  1110xxxx 10xxxxxx 10xxxxxx  U-00010000 - U-001FFFFF:  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx  U-00200000 - U-03FFFFFF:  111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx  U-04000000 - U-7FFFFFFF:  1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 

How to use the principle above?

Sample:The Unicode character U+00A9 = 1010 1001 (copyright sign) is encoded in UTF-8 as

    11000010 10101001 = 0xC2 0xA9

Explain :

A:1010

9:1001

principle 2 : 00000080 <  00A9 < 000007FF

from low to high

1. There 6 x in the low bit    we cut last 6 bit from  - 10101001(A9)  which is 101001

2.There 5 x in the high bit. we cut the rest 2 bit of A9 which is 10 and extend it to 5 bit with three 0 which is 00010

complete the low byte with 10. ----> (10) combine (101001) -> 10101001

complete the high byte with 110, ---> (110) combine (00010) -> 11000010

the Result is

11000010 10101001 = 0xC2 0xA9

you can also verify the following unicode with principle 3 use the way above:

U-00000800 - U-0000FFFF:  1110xxxx 10xxxxxx 10xxxxxx 

character U+2260 = 0010 0010 0110 0000 (not equal to) is encoded as:

    11100010 10001001 10100000 = 0xE2 0x89 0xA0

Reference:

http://www.cl.cam.ac.uk/~mgk25/unicode.html#unicode

时间: 2024-10-31 19:49:13

Explanation of UFT-8 and Unicode的相关文章

Python字符和字符值(ASCII或Unicode码值)转换方法

  这篇文章主要介绍了Python字符和字符值(ASCII或Unicode码值)转换方法,即把字符串在ASCII值或者Unicode值之间相与转换的方法,需要的朋友可以参考下 目的 将一个字符转化为相应的ASCII或Unicode码,或相反的操作. 方法 对于ASCII码(0~255范围) 代码如下: >>> print ord('A') 65 >>> print chr(65) A 对于Unicode字符,注意仅接收长度为1的Unicode字符 代码如下: >&

开源HYBUnicodeReadable日志显示Unicode中文

原文出自:标哥的技术博客 前言 开发中经常需要打印日志以查看数据是否正确,或者说查看数据的格式.但是,苹果对于我们的NSDictionary.NSSet.NSArray等值有中文时,打印出来的是Unicode编码,人类无法直接读懂,因此,笔者研究研究如何将打印出来的日志保持原有的格式化且能够将Unicode编码打印出来是正常人类可读懂的中文. 实现原理 苹果给我们提供了本地化的方法,对于NSDictionary.NSSet.NSArray都可以重写该方法来实现: NSSet实现 对于NSSet实

关于在UNICODE CString转UTF-8

问题描述 关于在UNICODE CString转UTF-8 如何Unicode下,将CString转成UTF-8.例如:CString xx=""您好""CSting str=URLEnCode(xx);转换后的结果:str的内容就是%e6%82 解决方案 CString str(_T(""我是中国人"")); CT2A szUtf8(strCP_UTF8); string s_utf8(szUtf8); CString st

web开发人员必须知道的Unicode与字符集相关知识

原文地址:http://www.joelonsoftware.com/articles/Unicode.html作者:Joel Spolsky 译文:http://local.joelonsoftware.com/wiki/Talk:Chinese_(Simplified) 每个程序员都绝对必须知道的关于字符集和Unicode的那点儿事(别找借口!) Unicode与字符集 你曾经是否觉得HTML中的"Content-Type"标签充满神秘?虽然你知道这个东西必须出现在HTML中,但对

Unicode中文转码函数代码

实在搞不懂TX用户名的编码竟然返回的值不一样如 "雷磊52D"就有如下两种返回?雷磊?52D 和 %01%u96F7%u78CA%0152D.相当的郁闷啊.好在TX还有返回中文名字就省去的解码的过程但还是记录下. 复制代码 代码如下: /// <summary> /// Unicode字符转换为中文字符,如96F778CA等" /// </summary> private char UnicodeToChineseByHex(string Unicod

用php实现gb2312和unicode间的编码转换

编码|转换 gb2312 和 unicode 间的编码转换 下面的例子是将 gb2312 转换为 "全"这种形式 php4.3.1以后的iconv函数很好用的,只是需要自己写一个uft8到unicode的转换函数查表(gb2312.txt)也行<?$text = "电子书库";preg_match_all("/[\x80-\xff]?./",$text,$ar);foreach($ar[0] as $v)  echo ""

JavaScript:gb2312转unicode -- &amp;#X形式

javascript <html><head><title>gb2312 unicode转换工具</title><div align=center><center><table border=0 cellpadding=0 cellspacing=0 style="border-collapse: collapse" width=600 id=AutoNumber1 height=26>  <tr

Js中实现拼音和UrlEncode的功能(利用GB和Unicode对照表)

encode|js|拼音 经常用vb写的urlencoding很容易实现UrlEncode,以及利用http://www.csdn.net/Develop/read_article.asp?id=13846中的代码很容易改写成vbs从而实现部分汉字的拼音 但由于有些人需要完全用javascript写,而javascript默认是Unicode所以就需要一个Unicode和GB的转换库 Unicode和GB的转换库和UrlEncode和getSpell函数,请下载http://www.blueid

unicode编码转换:PHP将汉字转换成Unicode编码的函数

这是一个将汉字转换成Unicode编码的PHP函数,支持GBK和UTF8编码.function uni_decode ($uncode){$word = json_decode(preg_replace_callback('/(\d{5});/', create_function('$dec', 'return \'\\u\'.dechex($dec[1]);'), '"'.$uncode.'"'));return $word;}对 Unicode 转换为汉字function uni_