Encoding - Unicode etc.

Check and set local encoding

1️⃣ Check the local encoding of the current system

Windows (cmd / PowerShell)

chcp

The current Code Page will be displayed:

950: Traditional Chinese (Big5)
936: Simplified Chinese (GBK)
65001：UTF-8

Linux / macOS (Terminal)

locale

CheckLANGorLC_CTYPEvalue, for example:

LANG=zh_TW.UTF-8

---

2️⃣ Check the current encoding in C++ program

#include <clocale>
#include <iostream>

int main() {
    std::cout << "Current locale: " << std::setlocale(LC_ALL, nullptr) << std::endl;
}

In Windows it will usually show something likeCorChinese (Traditional)_Taiwan.950。

---

3️⃣ Set local encoding

Windows command prompt (cmd)

chcp 65001

→ Switch command line to UTF-8.

PowerShell

$OutputEncoding = [Console]::OutputEncoding = [Text.Encoding]::UTF8

Set within a C++ program

#include <clocale>

int main() {
    std::setlocale(LC_ALL, "zh_TW.UTF-8"); // Set to UTF-8
}

Or set up Big5 in Windows

std::setlocale(LC_ALL, "Chinese_Taiwan.950");

---

4️⃣ Recommended settings

If you want to interoperate with .NET/Web,It is recommended to use UTF-8 uniformly。
VS projects are available atProperties → Advanced → Raw File Character EncodingchooseUTF-8。

If the console output is garbled, it can be combined with:

SetConsoleOutputCP(65001); //Set the output to UTF-8
SetConsoleCP(65001); //Set input to UTF-8

Set permanent code

1️⃣ Problem background

usechcp 65001You can only temporarily change the character encoding of the current command prompt (cmd). Once the window is closed or restarted, the default value will be restored (e.g.950Big5). If you want the entire system and all applications to use UTF-8, you need to modify the "Regional Settings" at the Windows system level.

---

2️⃣ Permanently set the entire Windows to use UTF-8

Step 1: Open regional settings

turn onControl Panel
chooseClocks and regions → regions (Region)
switch tomanage(Administrative) Pagination
ClickChange system locale...

Step 2: Enable UTF-8

Check the bottom:
✅ Beta: Use Unicode UTF-8 for worldwide language support (Use Unicode UTF-8 for worldwide language support)
Press OK and restart the system

After restarting, the default locale of Windows Console, C++, .NET, Python and other programs will be UTF-8.

---

3️⃣ Verify whether it is effective

Verify in cmd

chcp

If displayed:

Active code page: 65001

This means UTF-8 has become the default.

Validate in C++

#include <clocale>
#include <iostream>

int main() {
    std::cout << "Current locale: " << std::setlocale(LC_ALL, nullptr) << std::endl;
}

---

4️⃣ Precautions

Some older software or drivers do not support UTF-8 and may cause garbled characters.
If compatibility issues arise, you can uncheck this box and return to Big5.
Modern tools such as VSCode, Visual Studio, and PowerShell fully support UTF-8.

---

5️⃣ Alternatives (without modifying the system)

If you do not want the entire system to be converted to UTF-8, you can set startup parameters or in-program settings for certain applications:

cmd /K chcp 65001

Or call within the program:

SetConsoleOutputCP(65001);
SetConsoleCP(65001);

Unicode escape sequences

Basic concepts

The Unicode escape sequence is a method of representing Unicode characters using pure ASCII characters. Commonly used in programming language source code, JSON, string constants and cross-platform data exchange. This notation is used when the environment cannot directly enter or display a specific character.

\u format

The most common format is\uXXXX,inXXXXis a 4-digit hexadecimal number, Represents a Unicode code point.

\u0041 → A
\u00E9 → é
\u4E2D→ medium

\U format

Supported by some languages (such as Python)\UXXXXXXXX, using 8 hexadecimal digits, All Unicode code points can be represented directly.

\U0001F600 → 😀

agent pair representation

In environments that only support 16-bit Unicode (such as the JavaScript legacy specification), exceedU+FFFFThe characters require a surrogate pair.

\uD83D\uDE00 → 😀

Common language examples

JavaScript


const s = "\u4E2D\u6587";

Python


s = "\u4E2D\u6587"
s2 = "\U0001F600"

JSON


{
  "text": "\u4E2D\u6587"
}

When to use

Avoid garbled characters caused by inconsistent source code encoding
Ensure the correct transmission of cross-system and cross-language data
Represent non-ASCII characters in an ASCII-only environment

URL Encoding

Basic concepts

URL Encoding (also known as Percent-Encoding) is a way of converting characters into a representation that is safe for use in URLs. URLs only allow certain ASCII characters, the rest must be converted to percent plus hexadecimal.

encoding format

The encoding format is%HH,inHHIs the hexadecimal representation of the byte value of this character. If characters occupy multiple bytes under UTF-8, they will be encoded separately.

blank →%20
! → %21
Medium →%E4%B8%AD

reserved characters

Some characters in URLs have special semantics and are called reserved characters. Whether encoding is required depends on where it is used.

Unreserved characters

The following characters can be used directly in URLs without encoding.

A–Z a–z
0–9
- _ . ~

Common language examples

JavaScript

encodeURIComponent("Chinese test")
decodeURIComponent("%E4%B8%AD%E6%96%87%20test")

Python

from urllib.parse import quote, unquote

quote("Chinese test")
unquote("%E4%B8%AD%E6%96%87%20test")

Difference from plus sign

existapplication/x-www-form-urlencodedIn the format, White space characters will be encoded as+, rather than%20. Still used in general URL paths%20。

When to use

URL contains non-ASCII characters
Pass query parameters to avoid syntax conflicts
Ensure consistent resolution across browsers and servers

Hexadecimal Escapes

Basic concepts

Hexadecimal Escapes are a way of using hexadecimal numbers to represent characters. Often used in string constants in programming languages to represent specific bytes or ASCII characters.

\x format

The most common format is\xHH,inHHis a 2-digit hexadecimal number, Represents a byte value, usually corresponding to ASCII or a single byte character.

\x41 → A
\x61 → a
\x0A→ line feed

Scope of application

Hexadecimal Escapes mostly only work on single tuples, If you use UTF-8 encoded multi-byte characters, you need to split them into multiple\xHH。

Medium (UTF-8) →\xE4\xB8\xAD

Common language support

C / C++


char c = '\x41';

JavaScript


const s = "\x48\x65\x6C\x6C\x6F";

Python


s = "\x48\x65\x6C\x6C\x6F"

Differences from Unicode Escapes

Hexadecimal Escapes in bytes
Unicode Escapes in Unicode code points
Hexadecimal Escapes are better suited for low-level data or ASCII
Unicode Escapes is more suitable for multilingual text

When to use

Requires precise control over byte content
Process binary data or communication protocols
Indicates a non-printable control character

ASCII encoding table

ASCII Hex correspondence table
	0x0	0x1	0x2	0x3	0x4	0x5	0x6	0x7	0x8	0x9	0xA	0xB	0xC	0xD	0xE	0xF
0x00	NUL	SOH	STX	ETX	EOT	ENQ	ACK	BEL	BS	HT	LF	VT	FF	CR	SO	SI
0x10	DLE	DC1	DC2	DC3	DC4	NAK	SYN	ETB	CAN	EM	SUB	ESC	FS	GS	RS	US
0x20	␣	!	"	#	$	%	&	'	(	)	*	+	,	-	.	/
0x30	0	1	2	3	4	5	6	7	8	9	:	;	<	=	>	?
0x40	@	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
0x50	P	Q	R	S	T	U	V	W	X	Y	Z	[	\	]	^	_
0x60	`	a	b	c	d	e	f	g	h	i	j	k	l	m	n	o
0x70	p	q	r	s	t	u	v	w	x	y	z	{	\|	}	~	DEL
0x80	Ç	ü	é	â	ä	à	å	ç	ê	ë	è	ï	î	ì	Ä	Å
0x90	É	æ	Æ	ô	ö	ò	û	ù	ÿ	Ö	Ü	¢	£	¥	₧	ƒ
0xA0	á	í	ó	ú	ñ	Ñ	ª	º	¿	⌐	¬	½	¼	¡	«	»
0xB0	░	▒	▓	│	┤	╡	╢	╖	╕	╣	║	╗	╝	╜	╛	┐
0xC0	└	┴	┬	├	─	┼	╞	╟	╚	╔	╩	╦	╠	═	╬	╧
0xD0	╨	╤	╥	╙	╘	╒	╓	╫	╪	┘	┌	█	▄	▌	▐	▀
0xE0	α	ß	Γ	π	Σ	σ	µ	τ	Φ	Θ	Ω	δ	∞	φ	ε	∩
0xF0	≡	±	≥	≤	⌠	⌡	÷	≈	°	∙	·	√	ⁿ	²	■

Unicode range for all Chinese characters

Chinese characters in Unicode are mainly distributed in the following sections. The following lists the ranges of common Chinese characters (Hanzi) in the Unicode table, as well as detailed descriptions of each range.

Unicode range description

CJK Unified Ideographs (Chinese, Japanese and Korean Unified Ideographs):Mainly includes the most common Chinese characters.
CJK Unified Ideographs Extension A、B、C、D、E、F、G：This is a supplementary area that covers a wider range of Chinese characters, including ancient characters and a few less frequently used characters.
CJK Compatibility Ideographs:Contains characters that are compatible with other character systems and are often used for glyph compatibility requirements.

List of each range

scope name	Unicode range	illustrate
CJK Unified Ideographs	4E00–9FFF	Contains basic Chinese, Japanese and Korean characters, which is the most common Chinese character range.
CJK Unified Ideographs Extension A	3400–4DBF	Extended area A, containing less commonly used Chinese characters.
CJK Unified Ideographs Extension B	20000–2A6DF	Expanded area B mainly covers ancient characters and some rare Chinese characters.
CJK Unified Ideographs Extension C	2A700–2B73F	Area C has been expanded to further expand ancient characters and rare characters.
CJK Unified Ideographs Extension D	2B740–2B81F	Extended area D contains rarely used Chinese characters.
CJK Unified Ideographs Extension E	2B820–2CEAF	Expand area E, mainly adding more rare Chinese characters.
CJK Unified Ideographs Extension F	2CEB0–2EBEF	Expanded area F, including rarer ancient characters and Chinese characters.
CJK Unified Ideographs Extension G	30000–3134F	The extended G area is the latest added Chinese character area.
CJK Compatibility Ideographs	F900–FAFF	Compatibility zone for compatibility with older character set systems, such as different glyphs for Japanese glyphs.

Summarize

The range listed above includes most of the Chinese characters and is distributed in many different areas to meet different needs, including modern Chinese characters, ancient characters, and compatible characters. For Chinese font design or character analysis, these ranges provide complete font support.

Unicode Icons

UTF-8 character table

email: [email protected]

T:0000

資訊與搜尋 | 回dev首頁
email: Yan Sa [email protected] Line: 阿央

電話: 02-27566655 ,03-5924828

阿央
泱泱科技
捷昱科技泱泱企業

中文

Encoding - Unicode etc.

computer use

Check and set local encoding

1️⃣ Check the local encoding of the current system

Windows (cmd / PowerShell)

Linux / macOS (Terminal)

2️⃣ Check the current encoding in C++ program

3️⃣ Set local encoding

Windows command prompt (cmd)

PowerShell

Set within a C++ program

Or set up Big5 in Windows

4️⃣ Recommended settings

Set permanent code

1️⃣ Problem background

2️⃣ Permanently set the entire Windows to use UTF-8

Step 1: Open regional settings

Step 2: Enable UTF-8

3️⃣ Verify whether it is effective

Verify in cmd

Validate in C++

4️⃣ Precautions

5️⃣ Alternatives (without modifying the system)

Unicode escape sequences

Basic concepts

\u format

\U format

agent pair representation

Common language examples

When to use

URL Encoding

Basic concepts

encoding format

reserved characters

Unreserved characters

Common language examples

Difference from plus sign

When to use

Hexadecimal Escapes

Basic concepts

\x format

Scope of application

Common language support

Differences from Unicode Escapes

When to use

ASCII encoding table

Unicode range for all Chinese characters

Unicode range description

List of each range

Summarize

Unicode Icons