Encoding - Unicode etc.





Check and set local encoding

1️⃣ Check the local encoding of the current system

Windows (cmd / PowerShell)

chcp

The current Code Page will be displayed:

Linux / macOS (Terminal)

locale

CheckLANGorLC_CTYPEvalue, for example:

LANG=zh_TW.UTF-8
---

2️⃣ Check the current encoding in C++ program

#include <clocale>
#include <iostream>

int main() {
    std::cout << "Current locale: " << std::setlocale(LC_ALL, nullptr) << std::endl;
}

In Windows it will usually show something likeCorChinese (Traditional)_Taiwan.950

---

3️⃣ Set local encoding

Windows command prompt (cmd)

chcp 65001

→ Switch command line to UTF-8.

PowerShell

$OutputEncoding = [Console]::OutputEncoding = [Text.Encoding]::UTF8

Set within a C++ program

#include <clocale>

int main() {
    std::setlocale(LC_ALL, "zh_TW.UTF-8"); // Set to UTF-8
}

Or set up Big5 in Windows

std::setlocale(LC_ALL, "Chinese_Taiwan.950");
---

4️⃣ Recommended settings



Set permanent code

1️⃣ Problem background

usechcp 65001You can only temporarily change the character encoding of the current command prompt (cmd). Once the window is closed or restarted, the default value will be restored (e.g.950Big5). If you want the entire system and all applications to use UTF-8, you need to modify the "Regional Settings" at the Windows system level.

---

2️⃣ Permanently set the entire Windows to use UTF-8

Step 1: Open regional settings

  1. turn onControl Panel
  2. chooseClocks and regions → regions (Region)
  3. switch tomanage(Administrative) Pagination
  4. ClickChange system locale...

Step 2: Enable UTF-8

  1. Check the bottom:
    ✅ Beta: Use Unicode UTF-8 for worldwide language support (Use Unicode UTF-8 for worldwide language support)
  2. Press OK and restart the system

After restarting, the default locale of Windows Console, C++, .NET, Python and other programs will be UTF-8.

---

3️⃣ Verify whether it is effective

Verify in cmd

chcp

If displayed:

Active code page: 65001

This means UTF-8 has become the default.

Validate in C++

#include <clocale>
#include <iostream>

int main() {
    std::cout << "Current locale: " << std::setlocale(LC_ALL, nullptr) << std::endl;
}
---

4️⃣ Precautions

---

5️⃣ Alternatives (without modifying the system)

If you do not want the entire system to be converted to UTF-8, you can set startup parameters or in-program settings for certain applications:

cmd /K chcp 65001
Or call within the program:
SetConsoleOutputCP(65001);
SetConsoleCP(65001);


Unicode escape sequences

Basic concepts

The Unicode escape sequence is a method of representing Unicode characters using pure ASCII characters. Commonly used in programming language source code, JSON, string constants and cross-platform data exchange. This notation is used when the environment cannot directly enter or display a specific character.

\u format

The most common format is\uXXXX,inXXXXis a 4-digit hexadecimal number, Represents a Unicode code point.

\U format

Supported by some languages ​​(such as Python)\UXXXXXXXX, using 8 hexadecimal digits, All Unicode code points can be represented directly.

agent pair representation

In environments that only support 16-bit Unicode (such as the JavaScript legacy specification), exceedU+FFFFThe characters require a surrogate pair.

Common language examples

JavaScript


const s = "\u4E2D\u6587";

Python


s = "\u4E2D\u6587"
s2 = "\U0001F600"

JSON


{
  "text": "\u4E2D\u6587"
}

When to use



URL Encoding

Basic concepts

URL Encoding (also known as Percent-Encoding) is a way of converting characters into a representation that is safe for use in URLs. URLs only allow certain ASCII characters, the rest must be converted to percent plus hexadecimal.

encoding format

The encoding format is%HH,inHHIs the hexadecimal representation of the byte value of this character. If characters occupy multiple bytes under UTF-8, they will be encoded separately.

reserved characters

Some characters in URLs have special semantics and are called reserved characters. Whether encoding is required depends on where it is used.

Unreserved characters

The following characters can be used directly in URLs without encoding.

Common language examples

JavaScript

encodeURIComponent("Chinese test")
decodeURIComponent("%E4%B8%AD%E6%96%87%20test")

Python

from urllib.parse import quote, unquote

quote("Chinese test")
unquote("%E4%B8%AD%E6%96%87%20test")

Difference from plus sign

existapplication/x-www-form-urlencodedIn the format, White space characters will be encoded as+, rather than%20. Still used in general URL paths%20

When to use



Hexadecimal Escapes

Basic concepts

Hexadecimal Escapes are a way of using hexadecimal numbers to represent characters. Often used in string constants in programming languages to represent specific bytes or ASCII characters.

\x format

The most common format is\xHH,inHHis a 2-digit hexadecimal number, Represents a byte value, usually corresponding to ASCII or a single byte character.

Scope of application

Hexadecimal Escapes mostly only work on single tuples, If you use UTF-8 encoded multi-byte characters, you need to split them into multiple\xHH

Common language support

C / C++


char c = '\x41';

JavaScript


const s = "\x48\x65\x6C\x6C\x6F";

Python


s = "\x48\x65\x6C\x6C\x6F"

Differences from Unicode Escapes

When to use



ASCII encoding table

ASCII Hex correspondence table
0x00x10x20x30x40x50x60x70x80x90xA0xB0xC0xD0xE0xF
0x00NULSOHSTXETXEOTENQACKBELBSHTLFVTFFCRSOSI
0x10DLEDC1DC2DC3DC4NAKSYNETBCANEMSUBESCFSGSRSUS
0x20!"#$%&'()*+,-./
0x300123456789:;<=>?
0x40@ABCDEFGHIJKLMNO
0x50PQRSTUVWXYZ[\]^_
0x60`abcdefghijklmno
0x70pqrstuvwxyz{|}~DEL
0x80ÇüéâäàåçêëèïîìÄÅ
0x90ÉæÆôöòûùÿÖÜ¢£¥ƒ
0xA0áíóúñѪº¿¬½¼¡«»
0xB0
0xC0
0xD0
0xE0αßΓπΣσµτΦΘΩδφε
0xF0±÷°·²


Unicode range for all Chinese characters

Chinese characters in Unicode are mainly distributed in the following sections. The following lists the ranges of common Chinese characters (Hanzi) in the Unicode table, as well as detailed descriptions of each range.

Unicode range description

List of each range

scope name Unicode range illustrate
CJK Unified Ideographs 4E00–9FFF Contains basic Chinese, Japanese and Korean characters, which is the most common Chinese character range.
CJK Unified Ideographs Extension A 3400–4DBF Extended area A, containing less commonly used Chinese characters.
CJK Unified Ideographs Extension B 20000–2A6DF Expanded area B mainly covers ancient characters and some rare Chinese characters.
CJK Unified Ideographs Extension C 2A700–2B73F Area C has been expanded to further expand ancient characters and rare characters.
CJK Unified Ideographs Extension D 2B740–2B81F Extended area D contains rarely used Chinese characters.
CJK Unified Ideographs Extension E 2B820–2CEAF Expand area E, mainly adding more rare Chinese characters.
CJK Unified Ideographs Extension F 2CEB0–2EBEF Expanded area F, including rarer ancient characters and Chinese characters.
CJK Unified Ideographs Extension G 30000–3134F The extended G area is the latest added Chinese character area.
CJK Compatibility Ideographs F900–FAFF Compatibility zone for compatibility with older character set systems, such as different glyphs for Japanese glyphs.

Summarize

The range listed above includes most of the Chinese characters and is distributed in many different areas to meet different needs, including modern Chinese characters, ancient characters, and compatible characters. For Chinese font design or character analysis, these ranges provide complete font support.

Unicode Icons



  • UTF-8 character table


    email: [email protected]
    
    T:0000
    資訊與搜尋 | 回dev首頁
    email: Yan Sa [email protected] Line: 阿央
    電話: 02-27566655 ,03-5924828