CodePage

Links:
https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx
https://en.wikipedia.org/wiki/Code_page

1. Codepage的定义和历史

字符内码(charcter
code)指的是用来代表字符的内码.读者在输入和存储文档时都要使用内码,内码分为

  • 单字节内码 -- Single-Byte character sets (SBCS),可以支持256个字符编码.
  • 双字节内码 -- Double-Byte character
    sets)(DBCS),可以支持65000个字符编码.主要用来对大字符集的东方文字进行编码.

codepage 指的是一个经过挑选的以特定顺序排列的字符内码列表,对于早期的单字节内码的语种,codepage中的内码顺序使得系统
可以按照此列表来根据键盘的输入值给出一个对应的内码.对于双字节内码,则给出的是MultiByte到Unicode的对应表,这样就可以把
以Unicode形式存放的字符转化为相应的字符内码,或者反之,在Linux核心中对应的函数就是utf8_mbtowc和utf8_wctomb.

在1980年前,仍然没有任何国际标准如ISO-8859或Unicode来定义如何扩展US-ASCII编码以便非英语国家的用户使用.很多IT
厂商发明了他们自己的编码,并且使用了难以记忆的数目来标识:

例如936代表简体中文. 950代表繁体中文.

1.1 CJK
Codepage

同 Extended Unix Coding ( EUC )编码大不一样的是,下面所有的远东
codepage 都利用了C1控制码 { =80..=9F } 做为首字节, 使用ASCII值 { =40..=7E {
做为第二字节,这样才能包含多达数万个双字节字符,这表明在这种编码之中小于3F的ASCII值不一定代表ASCII字符.

CP932

Shift-JIS包含日本语 charset JIS X 0201 (每个字符一个字节) 和 JIS X 0208 (每个字符两个字节),所以 JIS X
0201平假名包含一个字节半宽的字符,其剩馀的60个字节被用做7076个汉字以及648个其他全宽字符的首字节.同EUC-JP编码区别的是,
Shift-JIS没有包含JIS X 202中定义的5802个汉字.

CP936

GBK 扩展了 EUC-CN 编码( GB 2312-80编码,包含 6763 个汉字)到Unicode
(GB13000.1-93)中定义的20902个汉字,中国大陆使用的是简体中文zh_CN.

CP949

UnifiedHangul (UHC) 是韩文 EUC-KR 编码(KS C 5601-1992 编码,包括2350 韩文音节和 4888
个汉字a)的超集,包含 8822个附加的韩文音节( 在C1中 )

CP950

是代替EUC-TW (CNS 11643-1992)的 Big5 编码(13072 繁体 zh_TW 中文字) 繁体中文,这些定义都在Ken Lunde的
CJK.INF中或者 Unicode 编码表中找到.

注意: Microsoft采用以上四种Codepage,因此要访问Microsoft的文件系统时必需采用上面的Codepage .

1.2 IBM的远东语言Codepage

IBM的Codepage分为SBCS和DBCS两种:

IBM SBCS Codepage

  • 37 (英文) *
  • 290 (日文) *
  • 833 (韩文) *
  • 836 (简体中文) *
  • 891 (韩文)
  • 897 (日文)
  • 903 (简体中文)
  • 904 (繁体中文)

    IBM DBCS Codepage

  • 300 (日文) *
  • 301 (日文)
  • 834 (韩文) *
  • 835 (繁体中文) *
  • 837 (简体中文) *
  • 926 (韩文)
  • 927 (繁体中文)
  • 928 (简体中文)

    将SBCS的Codepage和DBCS的Codepage混合起来就成为: IBM MBCS Codepage

  • 930 (日文) (Codepage 300 加 290) *
  • 932 (日文) (Codepage 301 加 897)
  • 933 (韩文) (Codepage 834 加 833) *
  • 934 (韩文) (Codepage 926 加 891)
  • 938 (繁体中文) (Codepage 927 加 904)
  • 936 (简体中文) (Codepage 928 加 903)
  • 5031 (简体中文) (Codepage 837 加 836) *
  • 5033 (繁体中文) (Codepage 835 加 37) *

    *代表采用EBCDIC编码格式

    由此可见,Mircosoft的CJK Codepage来源于IBM的Codepage.

    2. Linux下Codepage的作用

    在Linux下引入对Codepage的支持主要是为了访问FAT/VFAT/FAT32/NTFS/NCPFS等文件系统下的多语种文件名的问题,目前在NTFS和FAT32/VFAT下的文件系统上都使用了Unicode,这就需要系统在读取这些文件名时动态将其转换为相应的语言编码.因此引入了NLS支持.其相应的程序文件在/usr/src/linux/fs/nls下:

    • Config.in
    • Makefile
    • nls_base.c
    • nls_cp437.c
    • nls_cp737.c
    • nls_cp775.c
    • nls_cp850.c
    • nls_cp852.c
    • nls_cp855.c
    • nls_cp857.c
    • nls_cp860.c
    • nls_cp861.c
    • nls_cp862.c
    • nls_cp863.c
    • nls_cp864.c
    • nls_cp865.c
    • nls_cp866.c
    • nls_cp869.c
    • nls_cp874.c
    • nls_cp936.c
    • nls_cp950.c
    • nls_iso8859-1.c
    • nls_iso8859-15.c
    • nls_iso8859-2.c
    • nls_iso8859-3.c
    • nls_iso8859-4.c
    • nls_iso8859-5.c
    • nls_iso8859-6.c
    • nls_iso8859-7.c
    • nls_iso8859-8.c
    • nls_iso8859-9.c
    • nls_koi8-r.c

    实现了下列函数:

    • extern int utf8_mbtowc(__u16 *, const __u8 *, int);
    • extern int utf8_mbstowcs(__u16 *, const __u8 *, int);
    • extern int utf8_wctomb(__u8 *, __u16, int);
    • extern int utf8_wcstombs(__u8 *, const __u16 *, int);

    这样在加载相应的文件系统时就可以用下面的参数来设置Codepage:

    对于Codepage 437 来说

    mount -t vfat /dev/hda1 /mnt/1 -o codepage=437,iocharset=cp437

    这样在Linux下就可以正常访问不同语种的长文件名了.

    3. Linux下支持的Codepage

  • nls codepage 437 -- 美国/加拿大英语
  • nls codepage 737 -- 希腊语
  • nls codepage 775 -- 波罗的海语
  • nls codepage 850 -- 包括西欧语种(德语,西班牙语,意大利语)中的一些字符
  • nls codepage 852 -- Latin 2
    包括中东欧语种(阿尔巴尼亚语,克罗地亚语,捷克语,英语,芬兰语,匈牙利语,爱尔兰语,德语,波兰语,罗马利亚语,塞尔维亚语,斯洛伐克语,斯洛文尼亚语,Sorbian语)
  • nls codepage 855 -- 斯拉夫语
  • nls codepage 857 -- 土耳其语
  • nls codepage 860 -- 葡萄牙语
  • nls codepage 861 -- 冰岛语
  • nls codepage 862 -- 希伯来语
  • nls codepage 863 -- 加拿大语
  • nls codepage 864 -- 阿拉伯语
  • nls codepage 865 -- 日尔曼语系
  • nls codepage 866 -- 斯拉夫语/俄语
  • nls codepage 869 -- 希腊语(2)
  • nls codepage 874 -- 泰语
  • nls codepage 936 -- 简体中文GBK
  • nls codepage 950 -- 繁体中文Big5
  • nls iso8859-1 --
    西欧语系(阿尔巴尼亚语,西班牙加泰罗尼亚语,丹麦语,荷兰语,英语,Faeroese语,芬兰语,法语,德语,加里西亚语,爱尔兰语,冰岛语,意大利语,挪威语,葡萄牙语,瑞士语.)这同时适用于美国英语.
  • nls iso8859-2 -- Latin 2
    字符集,斯拉夫/中欧语系(捷克语,德语,匈牙利语,波兰语,罗马尼亚语,克罗地亚语,斯洛伐克语,斯洛文尼亚语)
  • nls iso8859-3 -- Latin 3 字符集, (世界语,加里西亚语,马耳他语,土耳其语)
  • nls iso8859-4 -- Latin 4 字符集, (爱莎尼亚语,拉脱维亚语,立陶宛语),是Latin 6 字符集的前序标准
  • nls iso8859-5 -- 斯拉夫语系(保加利亚语,Byelorussian语,马其顿语,俄语,塞尔维亚语,乌克兰语) 一般推荐使用 KOI8-R
    codepage
  • nls iso8859-6 -- 阿拉伯语.
  • nls iso8859-7 -- 现代希腊语
  • nls iso8859-8 -- 希伯来语
  • nls iso8859-9 -- Latin 5 字符集, (去掉了 Latin 1中不经常使用的一些冰岛语字符而代以土耳其语字符)
  • nls iso8859-10 -- Latin 6 字符集, (因纽特(格陵兰)语,萨摩斯岛语等Latin 4 中没有包括的北欧语种)
  • nls iso8859-15 -- Latin 9 字符集, 是Latin
    1字符集的更新版本,去掉一些不常用的字符,增加了对爱莎尼亚语的支持,修正了法语和芬兰语部份,增加了欧元字符)
  • nls koi8-r -- 俄语的缺省支持
  •  

    Microsoft Code Page Identifiers

    The following table defines the available code page identifiers.

    Note  
    ANSI code pages can be different on different computers, or
    can be changed for a single computer, leading to data corruption. For
    the most consistent results, applications should use Unicode, such as
    UTF-8 or UTF-16, instead of a specific code page.
     

    Identifier .NET Name Additional information
    037 IBM037 IBM EBCDIC US-Canada
    437 IBM437 OEM United States
    500 IBM500 IBM EBCDIC International
    708 ASMO-708 Arabic (ASMO 708)
    709 Arabic (ASMO-449+, BCON V4)
    710 Arabic - Transparent Arabic
    720 DOS-720 Arabic (Transparent ASMO); Arabic (DOS)
    737 ibm737 OEM Greek (formerly 437G); Greek (DOS)
    775 ibm775 OEM Baltic; Baltic (DOS)
    850 ibm850 OEM Multilingual Latin 1; Western European (DOS)
    852 ibm852 OEM Latin 2; Central European (DOS)
    855 IBM855 OEM Cyrillic (primarily Russian)
    857 ibm857 OEM Turkish; Turkish (DOS)
    858 IBM00858 OEM Multilingual Latin 1 + Euro symbol
    860 IBM860 OEM Portuguese; Portuguese (DOS)
    861 ibm861 OEM Icelandic; Icelandic (DOS)
    862 DOS-862 OEM Hebrew; Hebrew (DOS)
    863 IBM863 OEM French Canadian; French Canadian (DOS)
    864 IBM864 OEM Arabic; Arabic (864)
    865 IBM865 OEM Nordic; Nordic (DOS)
    866 cp866 OEM Russian; Cyrillic (DOS)
    869 ibm869 OEM Modern Greek; Greek, Modern (DOS)
    870 IBM870 IBM EBCDIC Multilingual/ROECE (Latin 2); IBM EBCDIC Multilingual Latin 2
    874 windows-874 ANSI/OEM Thai (ISO 8859-11); Thai (Windows)
    875 cp875 IBM EBCDIC Greek Modern
    932 shift_jis ANSI/OEM Japanese; Japanese (Shift-JIS)
    936 gb2312 ANSI/OEM Simplified Chinese (PRC, Singapore); Chinese Simplified (GB2312)
    949 ks_c_5601-1987 ANSI/OEM Korean (Unified Hangul Code)
    950 big5 ANSI/OEM Traditional Chinese (Taiwan; Hong Kong SAR, PRC); Chinese Traditional (Big5)
    1026 IBM1026 IBM EBCDIC Turkish (Latin 5)
    1047 IBM01047 IBM EBCDIC Latin 1/Open System
    1140 IBM01140 IBM EBCDIC US-Canada (037 + Euro symbol); IBM EBCDIC (US-Canada-Euro)
    1141 IBM01141 IBM EBCDIC Germany (20273 + Euro symbol); IBM EBCDIC (Germany-Euro)
    1142 IBM01142 IBM EBCDIC Denmark-Norway (20277 + Euro symbol); IBM EBCDIC (Denmark-Norway-Euro)
    1143 IBM01143 IBM EBCDIC Finland-Sweden (20278 + Euro symbol); IBM EBCDIC (Finland-Sweden-Euro)
    1144 IBM01144 IBM EBCDIC Italy (20280 + Euro symbol); IBM EBCDIC (Italy-Euro)
    1145 IBM01145 IBM EBCDIC Latin America-Spain (20284 + Euro symbol); IBM EBCDIC (Spain-Euro)
    1146 IBM01146 IBM EBCDIC United Kingdom (20285 + Euro symbol); IBM EBCDIC (UK-Euro)
    1147 IBM01147 IBM EBCDIC France (20297 + Euro symbol); IBM EBCDIC (France-Euro)
    1148 IBM01148 IBM EBCDIC International (500 + Euro symbol); IBM EBCDIC (International-Euro)
    1149 IBM01149 IBM EBCDIC Icelandic (20871 + Euro symbol); IBM EBCDIC (Icelandic-Euro)
    1200 utf-16 Unicode UTF-16, little endian byte order (BMP of ISO 10646); available only to managed applications
    1201 unicodeFFFE Unicode UTF-16, big endian byte order; available only to managed applications
    1250 windows-1250 ANSI Central European; Central European (Windows)
    1251 windows-1251 ANSI Cyrillic; Cyrillic (Windows)
    1252 windows-1252 ANSI Latin 1; Western European (Windows)
    1253 windows-1253 ANSI Greek; Greek (Windows)
    1254 windows-1254 ANSI Turkish; Turkish (Windows)
    1255 windows-1255 ANSI Hebrew; Hebrew (Windows)
    1256 windows-1256 ANSI Arabic; Arabic (Windows)
    1257 windows-1257 ANSI Baltic; Baltic (Windows)
    1258 windows-1258 ANSI/OEM Vietnamese; Vietnamese (Windows)
    1361 Johab Korean (Johab)
    10000 macintosh MAC Roman; Western European (Mac)
    10001 x-mac-japanese Japanese (Mac)
    10002 x-mac-chinesetrad MAC Traditional Chinese (Big5); Chinese Traditional (Mac)
    10003 x-mac-korean Korean (Mac)
    10004 x-mac-arabic Arabic (Mac)
    10005 x-mac-hebrew Hebrew (Mac)
    10006 x-mac-greek Greek (Mac)
    10007 x-mac-cyrillic Cyrillic (Mac)
    10008 x-mac-chinesesimp MAC Simplified Chinese (GB 2312); Chinese Simplified (Mac)
    10010 x-mac-romanian Romanian (Mac)
    10017 x-mac-ukrainian Ukrainian (Mac)
    10021 x-mac-thai Thai (Mac)
    10029 x-mac-ce MAC Latin 2; Central European (Mac)
    10079 x-mac-icelandic Icelandic (Mac)
    10081 x-mac-turkish Turkish (Mac)
    10082 x-mac-croatian Croatian (Mac)
    12000 utf-32 Unicode UTF-32, little endian byte order; available only to managed applications
    12001 utf-32BE Unicode UTF-32, big endian byte order; available only to managed applications
    20000 x-Chinese_CNS CNS Taiwan; Chinese Traditional (CNS)
    20001 x-cp20001 TCA Taiwan
    20002 x_Chinese-Eten Eten Taiwan; Chinese Traditional (Eten)
    20003 x-cp20003 IBM5550 Taiwan
    20004 x-cp20004 TeleText Taiwan
    20005 x-cp20005 Wang Taiwan
    20105 x-IA5 IA5 (IRV International Alphabet No. 5, 7-bit); Western European (IA5)
    20106 x-IA5-German IA5 German (7-bit)
    20107 x-IA5-Swedish IA5 Swedish (7-bit)
    20108 x-IA5-Norwegian IA5 Norwegian (7-bit)
    20127 us-ascii US-ASCII (7-bit)
    20261 x-cp20261 T.61
    20269 x-cp20269 ISO 6937 Non-Spacing Accent
    20273 IBM273 IBM EBCDIC Germany
    20277 IBM277 IBM EBCDIC Denmark-Norway
    20278 IBM278 IBM EBCDIC Finland-Sweden
    20280 IBM280 IBM EBCDIC Italy
    20284 IBM284 IBM EBCDIC Latin America-Spain
    20285 IBM285 IBM EBCDIC United Kingdom
    20290 IBM290 IBM EBCDIC Japanese Katakana Extended
    20297 IBM297 IBM EBCDIC France
    20420 IBM420 IBM EBCDIC Arabic
    20423 IBM423 IBM EBCDIC Greek
    20424 IBM424 IBM EBCDIC Hebrew
    20833 x-EBCDIC-KoreanExtended IBM EBCDIC Korean Extended
    20838 IBM-Thai IBM EBCDIC Thai
    20866 koi8-r Russian (KOI8-R); Cyrillic (KOI8-R)
    20871 IBM871 IBM EBCDIC Icelandic
    20880 IBM880 IBM EBCDIC Cyrillic Russian
    20905 IBM905 IBM EBCDIC Turkish
    20924 IBM00924 IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)
    20932 EUC-JP Japanese (JIS 0208-1990 and 0212-1990)
    20936 x-cp20936 Simplified Chinese (GB2312); Chinese Simplified (GB2312-80)
    20949 x-cp20949 Korean Wansung
    21025 cp1025 IBM EBCDIC Cyrillic Serbian-Bulgarian
    21027 (deprecated)
    21866 koi8-u Ukrainian (KOI8-U); Cyrillic (KOI8-U)
    28591 iso-8859-1 ISO 8859-1 Latin 1; Western European (ISO)
    28592 iso-8859-2 ISO 8859-2 Central European; Central European (ISO)
    28593 iso-8859-3 ISO 8859-3 Latin 3
    28594 iso-8859-4 ISO 8859-4 Baltic
    28595 iso-8859-5 ISO 8859-5 Cyrillic
    28596 iso-8859-6 ISO 8859-6 Arabic
    28597 iso-8859-7 ISO 8859-7 Greek
    28598 iso-8859-8 ISO 8859-8 Hebrew; Hebrew (ISO-Visual)
    28599 iso-8859-9 ISO 8859-9 Turkish
    28603 iso-8859-13 ISO 8859-13 Estonian
    28605 iso-8859-15 ISO 8859-15 Latin 9
    29001 x-Europa Europa 3
    38598 iso-8859-8-i ISO 8859-8 Hebrew; Hebrew (ISO-Logical)
    50220 iso-2022-jp ISO 2022 Japanese with no halfwidth Katakana; Japanese (JIS)
    50221 csISO2022JP ISO 2022 Japanese with halfwidth Katakana; Japanese (JIS-Allow 1 byte Kana)
    50222 iso-2022-jp ISO 2022 Japanese JIS X 0201-1989; Japanese (JIS-Allow 1 byte Kana - SO/SI)
    50225 iso-2022-kr ISO 2022 Korean
    50227 x-cp50227 ISO 2022 Simplified Chinese; Chinese Simplified (ISO 2022)
    50229 ISO 2022 Traditional Chinese
    50930 EBCDIC Japanese (Katakana) Extended
    50931 EBCDIC US-Canada and Japanese
    50933 EBCDIC Korean Extended and Korean
    50935 EBCDIC Simplified Chinese Extended and Simplified Chinese
    50936 EBCDIC Simplified Chinese
    50937 EBCDIC US-Canada and Traditional Chinese
    50939 EBCDIC Japanese (Latin) Extended and Japanese
    51932 euc-jp EUC Japanese
    51936 EUC-CN EUC Simplified Chinese; Chinese Simplified (EUC)
    51949 euc-kr EUC Korean
    51950 EUC Traditional Chinese
    52936 hz-gb-2312 HZ-GB2312 Simplified Chinese; Chinese Simplified (HZ)
    54936 GB18030 Windows XP and later: GB18030 Simplified Chinese (4 byte); Chinese Simplified (GB18030)
    57002 x-iscii-de ISCII Devanagari
    57003 x-iscii-be ISCII Bangla
    57004 x-iscii-ta ISCII Tamil
    57005 x-iscii-te ISCII Telugu
    57006 x-iscii-as ISCII Assamese
    57007 x-iscii-or ISCII Odia
    57008 x-iscii-ka ISCII Kannada
    57009 x-iscii-ma ISCII Malayalam
    57010 x-iscii-gu ISCII Gujarati
    57011 x-iscii-pa ISCII Punjabi
    65000 utf-7 Unicode (UTF-7)
    65001 utf-8 Unicode (UTF-8)

     

    Code Pages

    enum CodePages
    {
        IBM037=37,
        IBM437=437,
        IBM500=500,
        ASMO_708=708,
        DOS_720=720,
        ibm737=737,
        ibm775=775,
        ibm850=850,
        ibm852=852,
        IBM855=855,
        ibm857=857,
        IBM00858=858,
        IBM860=860,
        ibm861=861,
        DOS_862=862,
        IBM863=863,
        IBM864=864,
        IBM865=865,
        cp866=866,
        ibm869=869,
        IBM870=870,
        windows_874=874,
        cp875=875,
        shift_jis=932,
        gb2312=936,
        ks_c_5601_1987=949,
        big5=950,
        IBM1026=1026,
        IBM01047=1047,
        IBM01140=1140,
        IBM01141=1141,
        IBM01142=1142,
        IBM01143=1143,
        IBM01144=1144,
        IBM01145=1145,
        IBM01146=1146,
        IBM01147=1147,
        IBM01148=1148,
        IBM01149=1149,
        utf_16=1200,
        unicodeFFFE=1201,
        windows_1250=1250,
        windows_1251=1251,
        Windows_1252=1252,
        windows_1253=1253,
        windows_1254=1254,
        windows_1255=1255,
        windows_1256=1256,
        windows_1257=1257,
        windows_1258=1258,
        Johab=1361,
        macintosh=10000,
        x_mac_japanese=10001,
        x_mac_chinesetrad=10002,
        x_mac_korean=10003,
        x_mac_arabic=10004,
        x_mac_hebrew=10005,
        x_mac_greek=10006,
        x_mac_cyrillic=10007,
        x_mac_chinesesimp=10008,
        x_mac_romanian=10010,
        x_mac_ukrainian=10017,
        x_mac_thai=10021,
        x_mac_ce=10029,
        x_mac_icelandic=10079,
        x_mac_turkish=10081,
        x_mac_croatian=10082,
        utf_32=12000,
        utf_32BE=12001,
        x_Chinese_CNS=20000,
        x_cp20001=20001,
        x_Chinese_Eten=20002,
        x_cp20003=20003,
        x_cp20004=20004,
        x_cp20005=20005,
        x_IA5=20105,
        x_IA5_German=20106,
        x_IA5_Swedish=20107,
        x_IA5_Norwegian=20108,
        us_ascii=20127,
        x_cp20261=20261,
        x_cp20269=20269,
        IBM273=20273,
        IBM277=20277,
        IBM278=20278,
        IBM280=20280,
        IBM284=20284,
        IBM285=20285,
        IBM290=20290,
        IBM297=20297,
        IBM420=20420,
        IBM423=20423,
        IBM424=20424,
        x_EBCDIC_KoreanExtended=20833,
        IBM_Thai=20838,
        koi8_r=20866,
        IBM871=20871,
        IBM880=20880,
        IBM905=20905,
        IBM00924=20924,
        EUC_JP=20932,
        x_cp20936=20936,
        x_cp20949=20949,
        cp1025=21025,
        koi8_u=21866,
        iso_8859_1=28591,
        iso_8859_2=28592,
        iso_8859_3=28593,
        iso_8859_4=28594,
        iso_8859_5=28595,
        iso_8859_6=28596,
        iso_8859_7=28597,
        iso_8859_8=28598,
        iso_8859_9=28599,
        iso_8859_13=28603,
        iso_8859_15=28605,
        x_Europa=29001,
        iso_8859_8_i=38598,
        iso_2022_jp=50220,
        csISO2022JP=50221,
        iso_2022_kr=50225,
        x_cp50227=50227,
        euc_jp=51932,
        EUC_CN=51936,
        euc_kr=51949,
        hz_gb_2312=52936,
        GB18030=54936,
        x_iscii_de=57002,
        x_iscii_be=57003,
        x_iscii_ta=57004,
        x_iscii_te=57005,
        x_iscii_as=57006,
        x_iscii_or=57007,
        x_iscii_ka=57008,
        x_iscii_ma=57009,
        x_iscii_gu=57010,
        x_iscii_pa=57011,
        utf_7=65000,
        utf_8=65001
    };

     

     

     

     

     

    此条目发表在article分类目录。将固定链接加入收藏夹。