前置知识
Character encodings: Essential concepts
UTF-8 uses 1 byte to represent characters in the ASCII set, two bytes for characters in several more alphabetic blocks, and three bytes for the rest of the BMP. Supplementary characters use 4 bytes.
UTF-16 uses 2 bytes for any character in the BMP, and 4 bytes for supplementary characters.
UTF-32 uses 4 bytes for all characters.
In the following chart, the first line of numbers represents the position of a character in the Unicode coded character set. The other lines show the byte values used to represent that character in a particular character encoding.
A(Latin A.) | א(Hebrew alef.) | 好(Han ideograph AN.) | 𣎴(Chinese ideograph meaning ‘stump of tree’.) | |
---|---|---|---|---|
Code point | U+0041 | U+05D0 | U+597D | U+233B4 |
UTF-8 | 41 | D7 90 | E5 A5 BD | F0 A3 8E B4 |
UTF-16 | 00 41 | 05 D0 | 59 7D | D8 4C DF B4 |
UTF-32 | 00 00 00 41 | 00 00 05 D0 | 00 00 59 7D | 00 02 33 B4 |
乱码原因
以下 css 进行 sass 压缩
1 | a::before { |
压缩后(为了方便查看,对压缩后的代码进行了格式化)
1 | a::before { |
源代码 \4e2d
占用 5 个字节,压缩后代码 中
占用 3 个字节 E4 B8 AD
浏览器对压缩后的代码解析可能会出错,把 中
按 3 个 codepoint 解析,然后页面中就出现了三个 ascii 码
解决方案
css-unicode-loader
备注
1 | encodeURI('中') // '%E4%B8%AD' 三位 16 进制 |
sass 开发者一直强调他们把 css 压缩后,会在文件头声明 charset 或者 以 utf-8 with bom 格式保存文件,浏览器会正确处理,之所以导致这个问题,是因为后续的 css 处理(postcss) 把 charset 或 utf-8 with bom 格式丢掉了。
实际上 sass 的确是按 utf-8 with bom 保存的文件,后续的 css 处理也的确把 utf-8 with bom 格式丢掉了。但这问题和 bom 没关系。
The byte-order mark (BOM) in HTML