C#中各种编码格式的区别

    最近了解了一下C#中Encoding的不同编码方式的区别,和大家分享一下,如果有不对的地方还请各位批评指教。

    简单的说,为什么需要编码? 比如,我们的计算机中需要表示字母'a','b'等等字母,然而这些字母如何在计算机内存中表示?众所周知,在计算机内存中数据是以二进制来表示的,这样,我们就需要将这些需要表示的字母和数字或者符号转换成能在计算机中表示的二进制表示,这就是编码的意义所在。

    将字符编码成内存中的二进制表示,首先需要对字符集进行编码表示,每个编码代表一个固定的字符。然后再将这个字符的编码转换成内存中的二进制表示。

计算机常用字符的编码主要分为两种:ASCII码和Unicode码。

1. ASCII 码

   ASCII(American Standard Code for Information Interchange) 美国信息互换标准代码,是基于拉丁字母的一套电脑编码系统。ASCII是标准的单字节字符编码方案,用于基于文本的数据,使用7位或者8位的二进制组合起来表示128或者256中可能的字符。

    ASCII码最大的缺点就是只能表示美国英语中常用的字符数字和符号,不能表示其他语言中的字符符号等,比如中文中的汉字。

2. Unicode 码

    Unicode码是能够容纳世界上所有的文字和符号的编码方案,成为统一码,满足跨语言跨平台的需求,Unicode码是基于通用字符集(Universal Character Set)的标准发展起来的。Unicode码能够容纳所有的字符符号等,所以被使用的更加广泛,ASCII几乎不怎么用了。

    以上两种编码方式说明了如何将常用的字符进行编码,并赋予每个字符一个code point(a number)来表示,这个是固定的。方便以后的应用。比如汉字"字"对应的Unicode编码为23383.

    在这两种编码表示的基础上,就可以将编码表示成内存中可以使用的二进制方式了。

    1. ASCII码的编码比较简单,因为ASCII码是以字节为单位编码的,最大为255,直接可以使用一个字节在内存中进行表示,编码无需特殊操作。

    2. Unicode编码相对比较负责,因为Unicode要表示所有语言的字母符号等,所以编码没有那么简单。

    一下介绍为Unicode的编码方式。

    Unicode编码可分为以下五种:

    ASCIIEncoding

    UTF7Encoding

    UTF8Encoding

    UnicodeEncoding

    UTF32Encoding

下面先介绍Encoding的理解,然后分别详细介绍这几种编码方式的优点缺点和区别。

Encoding的理解

    Internally, the .NET Framework stores text as Unicode UTF-16. An encoder transforms this text data to a sequence of bytes. A decoder transforms a sequence of bytes
into this internal format. An encoding describes the rules by which an encoder or decoder operates. For example, the UTF8Encoding class describes the rules for encoding to and decoding from a sequence of bytes representing text as UTF-8. Encoding and decoding
can also include certain validation steps. For example, theUnicodeEncoding class checks all surrogates to make sure they constitute valid surrogate pairs. Both of these classes inherit from theEncoding class.

关键的一句为:An encoding describes the rules by which an encoder or decoder operates

    UTF是一种将Unicode码编码成内存中二进制表示的方法。The Unicode Standard assigns a code point (a number) to each character in every supported script. A Unicode Transformation Format (UTF)
is a way to encode that code point.

Selecting an Encoding Class

    when you have the opportunity to choose an encoding, you are strongly recommended to use a Unicode encoding, typically eitherUTF8Encoding orUnicodeEncoding (UTF32Encoding
is also supported). In particular,UTF8Encoding is preferred overASCIIEncoding. If the content is ASCII, the two encodings are identical, but UTF8Encoding can also represent every Unicode character, while ASCIIEncoding supports only the Unicode character values
between U+0000 and U+007F. Because ASCIIEncoding does not provide error detection,UTF8Encoding is also better for security.

    UTF8Encoding has been tuned to be as fast as possible and should be faster than any other encoding. Even for content that is entirely ASCII, operations performed
withUTF8Encoding are faster than operations performed with ASCIIEncoding. You should consider usingASCIIEncoding only for certain legacy applications. However, even in this case, UTF8Encoding might still be a better choice. Assuming default settings, the following
scenarios can occur:

    If your application has content that is not strictly ASCII and encodes it withASCIIEncoding, each non-ASCII character encodes as a question mark ("?"). If the application
then decodes this data, the information is lost.

    If the application has content that is not strictly ASCII and encodes it with UTF8Encoding, the result seems unintelligible if interpreted as ASCII. However, if the
application then decodes this data, the data performs a round trip successfully.

1. ASCIIEncoding

    ASCIIEncoding只需要使用一个字节对Unicode码进行编码。

    ASCII字母被限制在Unicode中最小的128个字符,从U+0000到U+007F。ASCIIEncoding不提供错误检测,如果需要错误检测的话,你的程序被推荐使用UTF8Encoding,UnicodeEncoding或者UTF32Encoding。

    UTF8Encoding,UnicodeEncoding或者UTF32Encoding更适合用来构建全球范围的应用程序。

When selecting the ASCII encoding for your applications, consider the following:

    The ASCII encoding is usually appropriate for protocols that require ASCII.

    If your application requires 8-bit encoding, the UTF-8 encoding is recommended over the ASCII encoding. For the characters 0-7F, the results are identical, but use
of UTF-8 avoids data loss by allowing representation of all Unicode characters that are representable. Note that the ASCII encoding has an 8th bit ambiguity that can allow malicious use, but the UTF-8 encoding removes ambiguity about the 8th bit.

    Previous versions of .NET Framework allowed spoofing by merely ignoring the 8th bit. The current version has been changed so that non-ASCII code points fall back
during the decoding of bytes.

2. UTF7Encoding

    Represents a UTF-7 encoding of Unicode characters.

    The UTF-7 encoding represents Unicode characters as sequences of 7-bit ASCII characters. This encoding supports certain protocols for which it is required, most often
e-mail or newsgroup protocols. Since UTF-7 is not particularly secure or robust, and most modern systems allow 8-bit encodings, UTF-8 should normally be preferred to UTF-7.

    UTF7Encoding does not provide error detection. For security reasons, the application should useUTF8Encoding,UnicodeEncoding, orUTF32Encoding and enable error detection.

    UTF7Encoding推荐不被使用。

3. UTF8Encoding

    UTF-8 encoding represents each code point as a sequence of one to four bytes. UTFEncoding将Unicode码编码成1-4个单字节码。

UTF-8 encoding 以字节对Unicode进行编码,不同范围的字符使用不同长度的编码,UTF-8 encoding 的最大长度为4个字节。

    UTF8Encoding的编码速度要比其他的所有编码方式都要快,即使是要编码的内容都是ASCII码,编码速度也要比用ASCIIEncoding编码的速度要快。

UTF8Encoding的效果要比ASCIIEncoding的效果好的多,所以推荐用UTF8Encoding,而不是ASCIIEncoding。

    when you have the opportunity to choose an encoding, you are strongly recommended to use a Unicode encoding, typically eitherUTF8Encoding orUnicodeEncoding (UTF32Encoding
is also supported). In particular,UTF8Encoding is preferred overASCIIEncoding.
    If the content is ASCII, the two encodings are identical, but UTF8Encoding can also represent every Unicode character, while ASCIIEncoding supports only the Unicode character values between U+0000 and U+007F. Because ASCIIEncoding does not provide
error detection,UTF8Encoding is also better for security.

    UTF8Encoding has been tuned to be as fast as possible and should be faster than any other encoding. Even for content that is entirely ASCII, operations performed
withUTF8Encoding are faster than operations performed with ASCIIEncoding. You should consider usingASCIIEncoding only for certain legacy applications. However, even in this case, UTF8Encoding might still be a better choice. 
  Assuming default settings, the following scenarios can occur:

    If your application has content that is not strictly ASCII and encodes it withASCIIEncoding, each non-ASCII character encodes as a question mark ("?"). If the application
then decodes this data, the information is lost.

    If the application has content that is not strictly ASCII and encodes it with UTF8Encoding, the result seems unintelligible if interpreted as ASCII. However, if the
application then decodes this data, the data performs a round trip successfully.

4. UnicodeEncoding

    UnicodeEncoding编码以16位无符号整数为编码单位,编码成1-2个16位的integers。

The Unicode Standard assigns a code point (a number) to each character in every supported script. A Unicode Transformation Format    (UTF) is a way to encode that code
point. TheUnicode Standard uses the following UTFs:

    UTF-8, which represents each code point as a sequence of one to four bytes.

    UTF-16, which represents each code point as a sequence of one to two 16-bit integers.

    UTF-32, which represents each code point as a 32-bit integer.

    UnicodeEncoding无法兼容ASCII,C#的默认编码方式就是UnicodeEncoding。使用的编码方式为UTF-16

5. UTF32Encoding

    UTF32Encoding 以32位无符号整数为编码单位,编码成一个32bit的integer

原文地址:

时间: 2024-09-21 05:24:53

C#中各种编码格式的区别的相关文章

中西方的平面广告中的人物肖像处理区别

译者的话:记得当年上设计课时,有一位设计老师壮怀激烈地在谈论中西方的平面广告差距,说到中西方的平面广告中的人物肖像处理区别时说,在中国,广告中的肖像都是完整无缺地放进去的,惟恐不将整个人都放进去别人就不知道里面是个人似的,而在西方的平面广告中,更多的只是截取肖像的某个部分,甚至一个人体背部的局部,或嘴角的一部分云云. 或者这番话是符合我们的经验的,至少,在我们周围接触到的很多广告中如果涉及到人像,都是将整个人或整个头部都放进去,或一个美女在搔首弄姿,或一个老人拿着一瓶什么东西的样子,很少有表现人

C++中Cbitmap,HBitmap,Bitmap区别及联系

  这篇文章主要介绍了C++中Cbitmap,HBitmap,Bitmap区别及联系的相关资料,需要的朋友可以参考下 加载一位图,可以使用LoadImage: HANDLE LoadImage(HINSTANCE hinst,LPCTSTR lpszName,UINT uType,int cxDesired,int CyDesired,UINT fuLoad); LoadImage可以用来加载位图,图标和光标 加载时可以规定加载图的映射到内存的大小: cxDesired:指定图标或光标的宽度,以

link中~和!有什么区别?~|又是什么运算符?

问题描述 link中~和!有什么区别?~|又是什么运算符? link中~和!有什么区别?~|又是什么运算符? 解决方案 ~|是什么我也不知道.哪里有这个 ~是按位取反,!是逻辑否. 解决方案二: ~|这个确实没有见过.至于-和!,和楼上的答案一样. 解决方案三: ink中~和!应该是通配符吧!

openvswitch中datapath和bridge区别

问题描述 openvswitch中datapath和bridge区别 openvswitch中datapath和bridge有什么区别?看openvswitch代码用什么软件比较方便,比如可以查看函数的定义 解决方案 openvswitch.Bridge.Datapathopenvswitch有port和bridge

Yii中CArrayDataProvider和CActiveDataProvider区别实例分析_php实例

本文实例讲述了Yii中CArrayDataProvider和CActiveDataProvider区别.分享给大家供大家参考,具体如下: 1.CArrayDataProvider   获取其他数据库或者数据表的数据列表 $sql = "Select * from tbl_count2 order by id desc"; $data = Yii::app()->marketdb->createCommand($sql)->queryAll(); $dataProvid

SQL点滴31—SQL语句中@@IDENTITY和@@ROWCOUNT区别

原文:SQL点滴31-SQL语句中@@IDENTITY和@@ROWCOUNT区别  SQL语句中@@IDENTITY和@@ROWCOUNT区别 在一条 INSERT.SELECT INTO 或大容量复制语句完成后,@@IDENTITY 中包含语句生成的最后一个标识值. 如果语句未影响任何包含标识列的表,则 @@IDENTITY 返回 NULL. 如果插入了多个行,生成了多个标识值,则 @@IDENTITY 将返回最后生成的标识值. 如果语句触发了一个或多个触发器,该触发器又执行了生成标识值的插入

c与c++中函数使用的区别

问题描述 c与c++中函数使用的区别 我在a.c中这样写 void msg() { printf("Hellon"); } 然后在main.c中直接调用msg函数,不用加什么extern声明之类的,也不用包含a.c,直接像这样调用 int main() { msg(); return 0; } 不会出现问题,但是我把a.c 和main.c分别改为a.cpp和main.cpp之后,编译就会提示错误.请问这是什么原因?? 解决方案 在很大程度上,标准C++是标准C的超集.实际上,所有C程序

字符串-java中String str1…的区别,详细见内容

问题描述 java中String str1-的区别,详细见内容 String str1,str2; str1 = "we are friends"; str2 = "we are friends"; 和 String str1 = "we are friends"; String str2 = "we are friends"; 在内存上的区别是什么? 是否有区别? 解决方案 没有区别,他们指向同一个对象 解决方案二: 就最后

java类的问题-关于java中的方法的区别

问题描述 关于java中的方法的区别 我是java菜鸟,想问一个问题关于 public static void main (String [] args){} 和static public void main (String [] args){} 的区别是什么?在jvm中是如何执行的? 解决方案 应该是没区别的吧 你要看区别 先分别编译后 后看看编译后的内容的区别吧 很多代码经过编译后效果是一样的