Links:
https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx
https://en.wikipedia.org/wiki/Code_page
字符内码(charcter
code)指的是用来代表字符的内码.读者在输入和存储文档时都要使用内码,内码分为
- 单字节内码 -- Single-Byte character sets (SBCS),可以支持256个字符编码.
- 双字节内码 -- Double-Byte character
sets)(DBCS),可以支持65000个字符编码.主要用来对大字符集的东方文字进行编码.
codepage 指的是一个经过挑选的以特定顺序排列的字符内码列表,对于早期的单字节内码的语种,codepage中的内码顺序使得系统
可以按照此列表来根据键盘的输入值给出一个对应的内码.对于双字节内码,则给出的是MultiByte到Unicode的对应表,这样就可以把
以Unicode形式存放的字符转化为相应的字符内码,或者反之,在Linux核心中对应的函数就是utf8_mbtowc和utf8_wctomb.
在1980年前,仍然没有任何国际标准如ISO-8859或Unicode来定义如何扩展US-ASCII编码以便非英语国家的用户使用.很多IT
厂商发明了他们自己的编码,并且使用了难以记忆的数目来标识:
例如936代表简体中文. 950代表繁体中文.
1.1 CJK
Codepage
同 Extended Unix Coding ( EUC )编码大不一样的是,下面所有的远东
codepage 都利用了C1控制码 { =80..=9F } 做为首字节, 使用ASCII值 { =40..=7E {
做为第二字节,这样才能包含多达数万个双字节字符,这表明在这种编码之中小于3F的ASCII值不一定代表ASCII字符.
CP932
Shift-JIS包含日本语 charset JIS X 0201 (每个字符一个字节) 和 JIS X 0208 (每个字符两个字节),所以 JIS X
0201平假名包含一个字节半宽的字符,其剩馀的60个字节被用做7076个汉字以及648个其他全宽字符的首字节.同EUC-JP编码区别的是,
Shift-JIS没有包含JIS X 202中定义的5802个汉字.
CP936
GBK 扩展了 EUC-CN 编码( GB 2312-80编码,包含 6763 个汉字)到Unicode
(GB13000.1-93)中定义的20902个汉字,中国大陆使用的是简体中文zh_CN.
CP949
UnifiedHangul (UHC) 是韩文 EUC-KR 编码(KS C 5601-1992 编码,包括2350 韩文音节和 4888
个汉字a)的超集,包含 8822个附加的韩文音节( 在C1中 )
CP950
是代替EUC-TW (CNS 11643-1992)的 Big5 编码(13072 繁体 zh_TW 中文字) 繁体中文,这些定义都在Ken Lunde的
CJK.INF中或者 Unicode 编码表中找到.
注意: Microsoft采用以上四种Codepage,因此要访问Microsoft的文件系统时必需采用上面的Codepage .
1.2 IBM的远东语言Codepage
IBM的Codepage分为SBCS和DBCS两种:
IBM SBCS Codepage
IBM DBCS Codepage
将SBCS的Codepage和DBCS的Codepage混合起来就成为: IBM MBCS Codepage
*代表采用EBCDIC编码格式
由此可见,Mircosoft的CJK Codepage来源于IBM的Codepage.
2. Linux下Codepage的作用
在Linux下引入对Codepage的支持主要是为了访问FAT/VFAT/FAT32/NTFS/NCPFS等文件系统下的多语种文件名的问题,目前在NTFS和FAT32/VFAT下的文件系统上都使用了Unicode,这就需要系统在读取这些文件名时动态将其转换为相应的语言编码.因此引入了NLS支持.其相应的程序文件在/usr/src/linux/fs/nls下:
- Config.in
- Makefile
- nls_base.c
- nls_cp437.c
- nls_cp737.c
- nls_cp775.c
- nls_cp850.c
- nls_cp852.c
- nls_cp855.c
- nls_cp857.c
- nls_cp860.c
- nls_cp861.c
- nls_cp862.c
- nls_cp863.c
- nls_cp864.c
- nls_cp865.c
- nls_cp866.c
- nls_cp869.c
- nls_cp874.c
- nls_cp936.c
- nls_cp950.c
- nls_iso8859-1.c
- nls_iso8859-15.c
- nls_iso8859-2.c
- nls_iso8859-3.c
- nls_iso8859-4.c
- nls_iso8859-5.c
- nls_iso8859-6.c
- nls_iso8859-7.c
- nls_iso8859-8.c
- nls_iso8859-9.c
- nls_koi8-r.c
实现了下列函数:
- extern int utf8_mbtowc(__u16 *, const __u8 *, int);
- extern int utf8_mbstowcs(__u16 *, const __u8 *, int);
- extern int utf8_wctomb(__u8 *, __u16, int);
- extern int utf8_wcstombs(__u8 *, const __u16 *, int);
这样在加载相应的文件系统时就可以用下面的参数来设置Codepage:
对于Codepage 437 来说
mount -t vfat /dev/hda1 /mnt/1 -o codepage=437,iocharset=cp437
这样在Linux下就可以正常访问不同语种的长文件名了.
3. Linux下支持的Codepage
包括中东欧语种(阿尔巴尼亚语,克罗地亚语,捷克语,英语,芬兰语,匈牙利语,爱尔兰语,德语,波兰语,罗马利亚语,塞尔维亚语,斯洛伐克语,斯洛文尼亚语,Sorbian语)
西欧语系(阿尔巴尼亚语,西班牙加泰罗尼亚语,丹麦语,荷兰语,英语,Faeroese语,芬兰语,法语,德语,加里西亚语,爱尔兰语,冰岛语,意大利语,挪威语,葡萄牙语,瑞士语.)这同时适用于美国英语.
字符集,斯拉夫/中欧语系(捷克语,德语,匈牙利语,波兰语,罗马尼亚语,克罗地亚语,斯洛伐克语,斯洛文尼亚语)
codepage
1字符集的更新版本,去掉一些不常用的字符,增加了对爱莎尼亚语的支持,修正了法语和芬兰语部份,增加了欧元字符)
Microsoft Code Page Identifiers
The following table defines the available code page identifiers.
ANSI code pages can be different on different computers, or
can be changed for a single computer, leading to data corruption. For
the most consistent results, applications should use Unicode, such as
UTF-8 or UTF-16, instead of a specific code page.
Identifier | .NET Name | Additional information |
---|---|---|
037 | IBM037 | IBM EBCDIC US-Canada |
437 | IBM437 | OEM United States |
500 | IBM500 | IBM EBCDIC International |
708 | ASMO-708 | Arabic (ASMO 708) |
709 | Arabic (ASMO-449+, BCON V4) | |
710 | Arabic - Transparent Arabic | |
720 | DOS-720 | Arabic (Transparent ASMO); Arabic (DOS) |
737 | ibm737 | OEM Greek (formerly 437G); Greek (DOS) |
775 | ibm775 | OEM Baltic; Baltic (DOS) |
850 | ibm850 | OEM Multilingual Latin 1; Western European (DOS) |
852 | ibm852 | OEM Latin 2; Central European (DOS) |
855 | IBM855 | OEM Cyrillic (primarily Russian) |
857 | ibm857 | OEM Turkish; Turkish (DOS) |
858 | IBM00858 | OEM Multilingual Latin 1 + Euro symbol |
860 | IBM860 | OEM Portuguese; Portuguese (DOS) |
861 | ibm861 | OEM Icelandic; Icelandic (DOS) |
862 | DOS-862 | OEM Hebrew; Hebrew (DOS) |
863 | IBM863 | OEM French Canadian; French Canadian (DOS) |
864 | IBM864 | OEM Arabic; Arabic (864) |
865 | IBM865 | OEM Nordic; Nordic (DOS) |
866 | cp866 | OEM Russian; Cyrillic (DOS) |
869 | ibm869 | OEM Modern Greek; Greek, Modern (DOS) |
870 | IBM870 | IBM EBCDIC Multilingual/ROECE (Latin 2); IBM EBCDIC Multilingual Latin 2 |
874 | windows-874 | ANSI/OEM Thai (ISO 8859-11); Thai (Windows) |
875 | cp875 | IBM EBCDIC Greek Modern |
932 | shift_jis | ANSI/OEM Japanese; Japanese (Shift-JIS) |
936 | gb2312 | ANSI/OEM Simplified Chinese (PRC, Singapore); Chinese Simplified (GB2312) |
949 | ks_c_5601-1987 | ANSI/OEM Korean (Unified Hangul Code) |
950 | big5 | ANSI/OEM Traditional Chinese (Taiwan; Hong Kong SAR, PRC); Chinese Traditional (Big5) |
1026 | IBM1026 | IBM EBCDIC Turkish (Latin 5) |
1047 | IBM01047 | IBM EBCDIC Latin 1/Open System |
1140 | IBM01140 | IBM EBCDIC US-Canada (037 + Euro symbol); IBM EBCDIC (US-Canada-Euro) |
1141 | IBM01141 | IBM EBCDIC Germany (20273 + Euro symbol); IBM EBCDIC (Germany-Euro) |
1142 | IBM01142 | IBM EBCDIC Denmark-Norway (20277 + Euro symbol); IBM EBCDIC (Denmark-Norway-Euro) |
1143 | IBM01143 | IBM EBCDIC Finland-Sweden (20278 + Euro symbol); IBM EBCDIC (Finland-Sweden-Euro) |
1144 | IBM01144 | IBM EBCDIC Italy (20280 + Euro symbol); IBM EBCDIC (Italy-Euro) |
1145 | IBM01145 | IBM EBCDIC Latin America-Spain (20284 + Euro symbol); IBM EBCDIC (Spain-Euro) |
1146 | IBM01146 | IBM EBCDIC United Kingdom (20285 + Euro symbol); IBM EBCDIC (UK-Euro) |
1147 | IBM01147 | IBM EBCDIC France (20297 + Euro symbol); IBM EBCDIC (France-Euro) |
1148 | IBM01148 | IBM EBCDIC International (500 + Euro symbol); IBM EBCDIC (International-Euro) |
1149 | IBM01149 | IBM EBCDIC Icelandic (20871 + Euro symbol); IBM EBCDIC (Icelandic-Euro) |
1200 | utf-16 | Unicode UTF-16, little endian byte order (BMP of ISO 10646); available only to managed applications |
1201 | unicodeFFFE | Unicode UTF-16, big endian byte order; available only to managed applications |
1250 | windows-1250 | ANSI Central European; Central European (Windows) |
1251 | windows-1251 | ANSI Cyrillic; Cyrillic (Windows) |
1252 | windows-1252 | ANSI Latin 1; Western European (Windows) |
1253 | windows-1253 | ANSI Greek; Greek (Windows) |
1254 | windows-1254 | ANSI Turkish; Turkish (Windows) |
1255 | windows-1255 | ANSI Hebrew; Hebrew (Windows) |
1256 | windows-1256 | ANSI Arabic; Arabic (Windows) |
1257 | windows-1257 | ANSI Baltic; Baltic (Windows) |
1258 | windows-1258 | ANSI/OEM Vietnamese; Vietnamese (Windows) |
1361 | Johab | Korean (Johab) |
10000 | macintosh | MAC Roman; Western European (Mac) |
10001 | x-mac-japanese | Japanese (Mac) |
10002 | x-mac-chinesetrad | MAC Traditional Chinese (Big5); Chinese Traditional (Mac) |
10003 | x-mac-korean | Korean (Mac) |
10004 | x-mac-arabic | Arabic (Mac) |
10005 | x-mac-hebrew | Hebrew (Mac) |
10006 | x-mac-greek | Greek (Mac) |
10007 | x-mac-cyrillic | Cyrillic (Mac) |
10008 | x-mac-chinesesimp | MAC Simplified Chinese (GB 2312); Chinese Simplified (Mac) |
10010 | x-mac-romanian | Romanian (Mac) |
10017 | x-mac-ukrainian | Ukrainian (Mac) |
10021 | x-mac-thai | Thai (Mac) |
10029 | x-mac-ce | MAC Latin 2; Central European (Mac) |
10079 | x-mac-icelandic | Icelandic (Mac) |
10081 | x-mac-turkish | Turkish (Mac) |
10082 | x-mac-croatian | Croatian (Mac) |
12000 | utf-32 | Unicode UTF-32, little endian byte order; available only to managed applications |
12001 | utf-32BE | Unicode UTF-32, big endian byte order; available only to managed applications |
20000 | x-Chinese_CNS | CNS Taiwan; Chinese Traditional (CNS) |
20001 | x-cp20001 | TCA Taiwan |
20002 | x_Chinese-Eten | Eten Taiwan; Chinese Traditional (Eten) |
20003 | x-cp20003 | IBM5550 Taiwan |
20004 | x-cp20004 | TeleText Taiwan |
20005 | x-cp20005 | Wang Taiwan |
20105 | x-IA5 | IA5 (IRV International Alphabet No. 5, 7-bit); Western European (IA5) |
20106 | x-IA5-German | IA5 German (7-bit) |
20107 | x-IA5-Swedish | IA5 Swedish (7-bit) |
20108 | x-IA5-Norwegian | IA5 Norwegian (7-bit) |
20127 | us-ascii | US-ASCII (7-bit) |
20261 | x-cp20261 | T.61 |
20269 | x-cp20269 | ISO 6937 Non-Spacing Accent |
20273 | IBM273 | IBM EBCDIC Germany |
20277 | IBM277 | IBM EBCDIC Denmark-Norway |
20278 | IBM278 | IBM EBCDIC Finland-Sweden |
20280 | IBM280 | IBM EBCDIC Italy |
20284 | IBM284 | IBM EBCDIC Latin America-Spain |
20285 | IBM285 | IBM EBCDIC United Kingdom |
20290 | IBM290 | IBM EBCDIC Japanese Katakana Extended |
20297 | IBM297 | IBM EBCDIC France |
20420 | IBM420 | IBM EBCDIC Arabic |
20423 | IBM423 | IBM EBCDIC Greek |
20424 | IBM424 | IBM EBCDIC Hebrew |
20833 | x-EBCDIC-KoreanExtended | IBM EBCDIC Korean Extended |
20838 | IBM-Thai | IBM EBCDIC Thai |
20866 | koi8-r | Russian (KOI8-R); Cyrillic (KOI8-R) |
20871 | IBM871 | IBM EBCDIC Icelandic |
20880 | IBM880 | IBM EBCDIC Cyrillic Russian |
20905 | IBM905 | IBM EBCDIC Turkish |
20924 | IBM00924 | IBM EBCDIC Latin 1/Open System (1047 + Euro symbol) |
20932 | EUC-JP | Japanese (JIS 0208-1990 and 0212-1990) |
20936 | x-cp20936 | Simplified Chinese (GB2312); Chinese Simplified (GB2312-80) |
20949 | x-cp20949 | Korean Wansung |
21025 | cp1025 | IBM EBCDIC Cyrillic Serbian-Bulgarian |
21027 | (deprecated) | |
21866 | koi8-u | Ukrainian (KOI8-U); Cyrillic (KOI8-U) |
28591 | iso-8859-1 | ISO 8859-1 Latin 1; Western European (ISO) |
28592 | iso-8859-2 | ISO 8859-2 Central European; Central European (ISO) |
28593 | iso-8859-3 | ISO 8859-3 Latin 3 |
28594 | iso-8859-4 | ISO 8859-4 Baltic |
28595 | iso-8859-5 | ISO 8859-5 Cyrillic |
28596 | iso-8859-6 | ISO 8859-6 Arabic |
28597 | iso-8859-7 | ISO 8859-7 Greek |
28598 | iso-8859-8 | ISO 8859-8 Hebrew; Hebrew (ISO-Visual) |
28599 | iso-8859-9 | ISO 8859-9 Turkish |
28603 | iso-8859-13 | ISO 8859-13 Estonian |
28605 | iso-8859-15 | ISO 8859-15 Latin 9 |
29001 | x-Europa | Europa 3 |
38598 | iso-8859-8-i | ISO 8859-8 Hebrew; Hebrew (ISO-Logical) |
50220 | iso-2022-jp | ISO 2022 Japanese with no halfwidth Katakana; Japanese (JIS) |
50221 | csISO2022JP | ISO 2022 Japanese with halfwidth Katakana; Japanese (JIS-Allow 1 byte Kana) |
50222 | iso-2022-jp | ISO 2022 Japanese JIS X 0201-1989; Japanese (JIS-Allow 1 byte Kana - SO/SI) |
50225 | iso-2022-kr | ISO 2022 Korean |
50227 | x-cp50227 | ISO 2022 Simplified Chinese; Chinese Simplified (ISO 2022) |
50229 | ISO 2022 Traditional Chinese | |
50930 | EBCDIC Japanese (Katakana) Extended | |
50931 | EBCDIC US-Canada and Japanese | |
50933 | EBCDIC Korean Extended and Korean | |
50935 | EBCDIC Simplified Chinese Extended and Simplified Chinese | |
50936 | EBCDIC Simplified Chinese | |
50937 | EBCDIC US-Canada and Traditional Chinese | |
50939 | EBCDIC Japanese (Latin) Extended and Japanese | |
51932 | euc-jp | EUC Japanese |
51936 | EUC-CN | EUC Simplified Chinese; Chinese Simplified (EUC) |
51949 | euc-kr | EUC Korean |
51950 | EUC Traditional Chinese | |
52936 | hz-gb-2312 | HZ-GB2312 Simplified Chinese; Chinese Simplified (HZ) |
54936 | GB18030 | Windows XP and later: GB18030 Simplified Chinese (4 byte); Chinese Simplified (GB18030) |
57002 | x-iscii-de | ISCII Devanagari |
57003 | x-iscii-be | ISCII Bangla |
57004 | x-iscii-ta | ISCII Tamil |
57005 | x-iscii-te | ISCII Telugu |
57006 | x-iscii-as | ISCII Assamese |
57007 | x-iscii-or | ISCII Odia |
57008 | x-iscii-ka | ISCII Kannada |
57009 | x-iscii-ma | ISCII Malayalam |
57010 | x-iscii-gu | ISCII Gujarati |
57011 | x-iscii-pa | ISCII Punjabi |
65000 | utf-7 | Unicode (UTF-7) |
65001 | utf-8 | Unicode (UTF-8) |
Code Pages
enum CodePages
{
IBM037=37,
IBM437=437,
IBM500=500,
ASMO_708=708,
DOS_720=720,
ibm737=737,
ibm775=775,
ibm850=850,
ibm852=852,
IBM855=855,
ibm857=857,
IBM00858=858,
IBM860=860,
ibm861=861,
DOS_862=862,
IBM863=863,
IBM864=864,
IBM865=865,
cp866=866,
ibm869=869,
IBM870=870,
windows_874=874,
cp875=875,
shift_jis=932,
gb2312=936,
ks_c_5601_1987=949,
big5=950,
IBM1026=1026,
IBM01047=1047,
IBM01140=1140,
IBM01141=1141,
IBM01142=1142,
IBM01143=1143,
IBM01144=1144,
IBM01145=1145,
IBM01146=1146,
IBM01147=1147,
IBM01148=1148,
IBM01149=1149,
utf_16=1200,
unicodeFFFE=1201,
windows_1250=1250,
windows_1251=1251,
Windows_1252=1252,
windows_1253=1253,
windows_1254=1254,
windows_1255=1255,
windows_1256=1256,
windows_1257=1257,
windows_1258=1258,
Johab=1361,
macintosh=10000,
x_mac_japanese=10001,
x_mac_chinesetrad=10002,
x_mac_korean=10003,
x_mac_arabic=10004,
x_mac_hebrew=10005,
x_mac_greek=10006,
x_mac_cyrillic=10007,
x_mac_chinesesimp=10008,
x_mac_romanian=10010,
x_mac_ukrainian=10017,
x_mac_thai=10021,
x_mac_ce=10029,
x_mac_icelandic=10079,
x_mac_turkish=10081,
x_mac_croatian=10082,
utf_32=12000,
utf_32BE=12001,
x_Chinese_CNS=20000,
x_cp20001=20001,
x_Chinese_Eten=20002,
x_cp20003=20003,
x_cp20004=20004,
x_cp20005=20005,
x_IA5=20105,
x_IA5_German=20106,
x_IA5_Swedish=20107,
x_IA5_Norwegian=20108,
us_ascii=20127,
x_cp20261=20261,
x_cp20269=20269,
IBM273=20273,
IBM277=20277,
IBM278=20278,
IBM280=20280,
IBM284=20284,
IBM285=20285,
IBM290=20290,
IBM297=20297,
IBM420=20420,
IBM423=20423,
IBM424=20424,
x_EBCDIC_KoreanExtended=20833,
IBM_Thai=20838,
koi8_r=20866,
IBM871=20871,
IBM880=20880,
IBM905=20905,
IBM00924=20924,
EUC_JP=20932,
x_cp20936=20936,
x_cp20949=20949,
cp1025=21025,
koi8_u=21866,
iso_8859_1=28591,
iso_8859_2=28592,
iso_8859_3=28593,
iso_8859_4=28594,
iso_8859_5=28595,
iso_8859_6=28596,
iso_8859_7=28597,
iso_8859_8=28598,
iso_8859_9=28599,
iso_8859_13=28603,
iso_8859_15=28605,
x_Europa=29001,
iso_8859_8_i=38598,
iso_2022_jp=50220,
csISO2022JP=50221,
iso_2022_kr=50225,
x_cp50227=50227,
euc_jp=51932,
EUC_CN=51936,
euc_kr=51949,
hz_gb_2312=52936,
GB18030=54936,
x_iscii_de=57002,
x_iscii_be=57003,
x_iscii_ta=57004,
x_iscii_te=57005,
x_iscii_as=57006,
x_iscii_or=57007,
x_iscii_ka=57008,
x_iscii_ma=57009,
x_iscii_gu=57010,
x_iscii_pa=57011,
utf_7=65000,
utf_8=65001
};