最近了解了一下C#中Encoding的不同编码方式的区别,和大家分享一下,如果有不对的地方还请各位批评指教。
简单的说,为什么需要编码? 比如,我们的计算机中需要表示字母'a','b'等等字母,然而这些字母如何在计算机内存中表示?众所周知,在计算机内存中数据是以二进制来表示的,这样,我们就需要将这些需要表示的字母和数字或者符号转换成能在计算机中表示的二进制表示,这就是编码的意义所在。
将字符编码成内存中的二进制表示,首先需要对字符集进行编码表示,每个编码代表一个固定的字符。然后再将这个字符的编码转换成内存中的二进制表示。
计算机常用字符的编码主要分为两种:ASCII码和Unicode码。
1. ASCII 码
ASCII(American Standard Code for Information Interchange) 美国信息互换标准代码,是基于拉丁字母的一套电脑编码系统。ASCII是标准的单字节字符编码方案,用于基于文本的数据,使用7位或者8位的二进制组合起来表示128或者256中可能的字符。
ASCII码最大的缺点就是只能表示美国英语中常用的字符数字和符号,不能表示其他语言中的字符符号等,比如中文中的汉字。
2. Unicode 码
Unicode码是能够容纳世界上所有的文字和符号的编码方案,成为统一码,满足跨语言跨平台的需求,Unicode码是基于通用字符集(Universal Character Set)的标准发展起来的。Unicode码能够容纳所有的字符符号等,所以被使用的更加广泛,ASCII几乎不怎么用了。
以上两种编码方式说明了如何将常用的字符进行编码,并赋予每个字符一个code point(a number)来表示,这个是固定的。方便以后的应用。比如汉字"字"对应的Unicode编码为23383.
在这两种编码表示的基础上,就可以将编码表示成内存中可以使用的二进制方式了。
1. ASCII码的编码比较简单,因为ASCII码是以字节为单位编码的,最大为255,直接可以使用一个字节在内存中进行表示,编码无需特殊操作。
2. Unicode编码相对比较负责,因为Unicode要表示所有语言的字母符号等,所以编码没有那么简单。
一下介绍为Unicode的编码方式。
Unicode编码可分为以下五种:
ASCIIEncoding
UTF7Encoding
UTF8Encoding
UnicodeEncoding
UTF32Encoding
下面先介绍Encoding的理解,然后分别详细介绍这几种编码方式的优点缺点和区别。
Encoding的理解
Internally, the .NET Framework stores text as Unicode UTF-16. An encoder transforms this text data to a sequence of bytes. A decoder transforms a sequence of bytes
into this internal format. An encoding describes the rules by which an encoder or decoder operates. For example, the UTF8Encoding class describes the rules for encoding to and decoding from a sequence of bytes representing text as UTF-8. Encoding and decoding
can also include certain validation steps. For example, theUnicodeEncoding class checks all surrogates to make sure they constitute valid surrogate pairs. Both of these classes inherit from theEncoding class.
关键的一句为:An encoding describes the rules by which an encoder or decoder operates
UTF是一种将Unicode码编码成内存中二进制表示的方法。The Unicode Standard assigns a code point (a number) to each character in every supported script. A Unicode Transformation Format (UTF)
is a way to encode that code point.
Selecting an Encoding Class
when you have the opportunity to choose an encoding, you are strongly recommended to use a Unicode encoding, typically eitherUTF8Encoding orUnicodeEncoding (UTF32Encoding
is also supported). In particular,UTF8Encoding is preferred overASCIIEncoding. If the content is ASCII, the two encodings are identical, but UTF8Encoding can also represent every Unicode character, while ASCIIEncoding supports only the Unicode character values
between U+0000 and U+007F. Because ASCIIEncoding does not provide error detection,UTF8Encoding is also better for security.
UTF8Encoding has been tuned to be as fast as possible and should be faster than any other encoding. Even for content that is entirely ASCII, operations performed
withUTF8Encoding are faster than operations performed with ASCIIEncoding. You should consider usingASCIIEncoding only for certain legacy applications. However, even in this case, UTF8Encoding might still be a better choice. Assuming default settings, the following
scenarios can occur:
If your application has content that is not strictly ASCII and encodes it withASCIIEncoding, each non-ASCII character encodes as a question mark ("?"). If the application
then decodes this data, the information is lost.
If the application has content that is not strictly ASCII and encodes it with UTF8Encoding, the result seems unintelligible if interpreted as ASCII. However, if the
application then decodes this data, the data performs a round trip successfully.
1. ASCIIEncoding
ASCIIEncoding只需要使用一个字节对Unicode码进行编码。
ASCII字母被限制在Unicode中最小的128个字符,从U+0000到U+007F。ASCIIEncoding不提供错误检测,如果需要错误检测的话,你的程序被推荐使用UTF8Encoding,UnicodeEncoding或者UTF32Encoding。
UTF8Encoding,UnicodeEncoding或者UTF32Encoding更适合用来构建全球范围的应用程序。
When selecting the ASCII encoding for your applications, consider the following:
The ASCII encoding is usually appropriate for protocols that require ASCII.
If your application requires 8-bit encoding, the UTF-8 encoding is recommended over the ASCII encoding. For the characters 0-7F, the results are identical, but use
of UTF-8 avoids data loss by allowing representation of all Unicode characters that are representable. Note that the ASCII encoding has an 8th bit ambiguity that can allow malicious use, but the UTF-8 encoding removes ambiguity about the 8th bit.
Previous versions of .NET Framework allowed spoofing by merely ignoring the 8th bit. The current version has been changed so that non-ASCII code points fall back
during the decoding of bytes.
2. UTF7Encoding
Represents a UTF-7 encoding of Unicode characters.
The UTF-7 encoding represents Unicode characters as sequences of 7-bit ASCII characters. This encoding supports certain protocols for which it is required, most often
e-mail or newsgroup protocols. Since UTF-7 is not particularly secure or robust, and most modern systems allow 8-bit encodings, UTF-8 should normally be preferred to UTF-7.
UTF7Encoding does not provide error detection. For security reasons, the application should useUTF8Encoding,UnicodeEncoding, orUTF32Encoding and enable error detection.
UTF7Encoding推荐不被使用。
3. UTF8Encoding
UTF-8 encoding represents each code point as a sequence of one to four bytes. UTFEncoding将Unicode码编码成1-4个单字节码。
UTF-8 encoding 以字节对Unicode进行编码,不同范围的字符使用不同长度的编码,UTF-8 encoding 的最大长度为4个字节。
UTF8Encoding的编码速度要比其他的所有编码方式都要快,即使是要编码的内容都是ASCII码,编码速度也要比用ASCIIEncoding编码的速度要快。
UTF8Encoding的效果要比ASCIIEncoding的效果好的多,所以推荐用UTF8Encoding,而不是ASCIIEncoding。
when you have the opportunity to choose an encoding, you are strongly recommended to use a Unicode encoding, typically eitherUTF8Encoding orUnicodeEncoding (UTF32Encoding
is also supported). In particular,UTF8Encoding is preferred overASCIIEncoding.
If the content is ASCII, the two encodings are identical, but UTF8Encoding can also represent every Unicode character, while ASCIIEncoding supports only the Unicode character values between U+0000 and U+007F. Because ASCIIEncoding does not provide
error detection,UTF8Encoding is also better for security.
UTF8Encoding has been tuned to be as fast as possible and should be faster than any other encoding. Even for content that is entirely ASCII, operations performed
withUTF8Encoding are faster than operations performed with ASCIIEncoding. You should consider usingASCIIEncoding only for certain legacy applications. However, even in this case, UTF8Encoding might still be a better choice.
Assuming default settings, the following scenarios can occur:
If your application has content that is not strictly ASCII and encodes it withASCIIEncoding, each non-ASCII character encodes as a question mark ("?"). If the application
then decodes this data, the information is lost.
If the application has content that is not strictly ASCII and encodes it with UTF8Encoding, the result seems unintelligible if interpreted as ASCII. However, if the
application then decodes this data, the data performs a round trip successfully.
4. UnicodeEncoding
UnicodeEncoding编码以16位无符号整数为编码单位,编码成1-2个16位的integers。
The Unicode Standard assigns a code point (a number) to each character in every supported script. A Unicode Transformation Format (UTF) is a way to encode that code
point. TheUnicode Standard uses the following UTFs:
UTF-8, which represents each code point as a sequence of one to four bytes.
UTF-16, which represents each code point as a sequence of one to two 16-bit integers.
UTF-32, which represents each code point as a 32-bit integer.
UnicodeEncoding无法兼容ASCII,C#的默认编码方式就是UnicodeEncoding。使用的编码方式为UTF-16
5. UTF32Encoding
UTF32Encoding 以32位无符号整数为编码单位,编码成一个32bit的integer