HOW TO DETERMINE UTF8 AND UNICODE AND GBK ENCODING 100

9 answers

Anonymous users2024-02-06

The 8-bit Unicode conversion format (UTF-8) is a relatively new convention for encoding a wide range of characters.

It is the standard for character identification and a reference for various programming languages and devices, helping to standardize the display of letters, numbers, and other characters.

In many cases, UTF-8 replaces an old convention called the American Standard Code for Information Interchange (ASCII).

ASCII handles all the characters required for English language text, but UTF-8 handles more diverse sets of symbols for other languages that don't use the English or Roman alphabet. UTF-8 is considered backwards compatible with ASCII.
Anonymous users2024-02-05

UTF-8 is a variable-length byte encoding. For the UTF-8 encoding of a character, if there is only one byte, the highest binary bit is 0; In the case of multi-bytes, the first byte starts with the highest bit, the number of consecutive binary bits with a value of 1 determines the number of digits it encodes, and the rest of the bytes start with 10. UTF-8 can be used up to 6 bytes.
Anonymous users2024-02-04

First, the subject is different.

1. UTF-8 GBK: formulated by the National Information Technology Standardization Technical Committee of the People's Republic of China on December 1, 1995, and jointly issued by the Standardization Department of the State Bureau of Technical Supervision and the Department of Science and Technology and Quality Supervision of the Ministry of Electronics Industry on December 15, 1995 in the form of a technical supervision bid letter 1995 229.

2. UTF8 GB2312: It is based on the Basic Set of Chinese Character Encoding Character Set for Information Exchange issued in 1980, which is the Chinese national standard for Chinese information processing and is a mandatory Chinese encoding.

Second, the characteristics are different.

1. UTF-8 GBK: Backwards are compatible with GB 2312 encoding and upwards support ISO international standards, which is a product of the transition from the former to the latter.

2. UTF8 GB2312: A total of 6763 simplified Chinese characters and 682 symbols are included, including 3755 first-level characters, sorted by pinyin, and 3008 second-level characters, sorted by radicals.

The formulation and application of this standard have played a great role in standardizing and promoting the process of Chinese informatization.

Third, the number of bytes is different.

1. UTF-8 GBK: It is an internal code expansion specification based on GB2312-80 standard, using a double-byte encoding scheme, which ranges from 8140 to FEFE (excluding XX7F), with a total of 23940 code points and a total of 21003 Chinese characters, which is fully compatible with GB2312-80 standard.

2. UTF8 GB2312: The standard covers single-byte, double-byte, and four-byte characters and Chinese characters, totaling more than 28,000 characters.
Anonymous users2024-02-03

Mark the collection, and the watchtower owner understands.
Anonymous users2024-02-02

The differences between the two are as follows:

Unicode means that each character corresponds to a hexadecimal number. Computers only understand binary and strictly follow the Unicode way (UCS-2).

UTF-8 refers to a single-byte character, the first digit of the byte is set to 0, for English text, the UTF-8 code only occupies one byte, which is exactly the same as the ASCII code; N bytes of characters (n>1), the first n bits of the first byte are set to 1, the n+1 bits are set to 0, and the first two bits of the following bytes are set to 10, and the remaining empty bits of the n bytes are filled with the Unicode code of the character, and the high bits are filled with 0.

1. Introduction to Unicode:

Unicode (Unicode, Universal Code, Single Code) is an industry standard in the field of computer science, including character sets, encoding schemes, etc. Unicode was created to solve the limitations of traditional character encoding schemes, and it sets a unified and unique binary encoding for each character in each language to meet the requirements of cross-language and cross-platform text conversion and processing. R&D began in 1990 and was officially announced in 1994.

2. Introduction to UTF-8:

UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode, and is also a prefix code, also known as universal code. Founded in 1992 by Ken Thompson. It can be used to represent any character in the Unicode standard, and the first byte of its encoding is still ASCII-compatible, allowing the original software that handles ASCII characters to continue to be used with little or no modification.

As a result, it is becoming the preferred encoding for e-mail, web pages, and other applications that store or transmit text.
Anonymous users2024-02-01

According to the relationship between Unicode encoding and UTF-8 encoding, a rough mind map is written, and part of the content is excerpted.

UTF-8 encoding encodes a Unicode character into 1-6 bytes according to different numbers, commonly used English letters are encoded into 1 byte, Chinese characters are usually 3 bytes, and only very rare characters will be encoded into 4-6 bytes. If you want to transfer text with a large number of English characters, using UTF-8 encoding can save space:

utfIt is an abbreviation for Unicode Transformation Format, which means to convert Unicode characters into a certain format. The UTF family of encoding schemes (UTF-8, UTF-16, UTF-32) are all derived from the Unicode encoding scheme to accommodate different data storage or delivery, and they can fully represent all characters in the Unicode standard. Currently, UTF-8 is widely used in these variation schemes, while UTF-16 and UTF-32 are rarely used.

As you can also see from the ** above, UTF-8 encoding has an additional benefit, that is, ASCII encoding can actually be seen as part of UTF-8 encoding, so a large number of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding.

In the computer memory, Unicode encoding is uniformly used, and when it needs to be saved to hard disk or needs to be transferred, it is converted to UTF-8 encoding.
Anonymous users2024-01-31

We all know that the Compute Royals uses 0s and 1s to store text. For example, if the character c is stored as 01000011, then the computer needs to go through two steps to display this character:

1. The computer reads 01000011 and gets the number 67 because 67 is encoded as 01000011.

2. The computer looks for 67 in the Unicode character set and finds C.

Similarly: 1. My computer maps C to 67 in the Unicode character set.

2. My computer encodes 67 into 01000011 and sends it to the web server.

Almost all web applications use the Unicode character set because there's no reason to use a different character set.

The Unicode character set contains millions of characters. The simplest encoding is the wide denier UTF-32, which uses 32 bits per character. This is the easiest way to do this, because computers have always thought of 32 bits as numbers, and computers are best at processing numbers.

But the problem is that this is too wasteful of space.

UTF-8 saves space, in UTF-8, the character c only needs 8 bits, and some less commonly used characters, such as 32 bits. Other characters may use 16 or 24 bits. An article like this one, if encoded in UTF-8, takes up only about a quarter of the space of UTF-32. Excerpt from.
Anonymous users2024-01-30

First, the subject is different.

1. GB2312: is a character encoding name, which is a kind of Chinese Simplified Chinese encoding.

2. UTF-8: It is a variable-length character encoding for Unicode.

3. ISO-8859-1: It is a single-byte encoding, backward compatible with ASCII, and its encoding range is 0x00-0xff, which is completely consistent with ASCII between 0x00 and 0x7f.

Second, the characteristics are different.

1. GB2312: It is based on the Basic Set of Chinese Character Encoding Character Set for Information Exchange issued in 1980, which is the Chinese national standard for information processing of Chinese Tomato and is a mandatory Chinese code.

2. UTF-8: It can be used to represent any character in the Unicode standard, and the first byte in its encoding is still compatible with ASCII, so that the original software that processes ASCII characters can continue to be used without or only a few modifications.

3. ISO-8859-1: In addition to the characters included in ASCII, it also includes the corresponding text symbols of Western European languages, Greek, Thai, Arabic and Hebrew.

Third, the role is different.

1. GB2312: The formulation and application have played a great role in standardizing and promoting the process of Chinese informatization.

2. UTF-8: It has gradually become the preferred encoding for e-mails, web pages, and other applications that store or transmit text.

3. ISO-8859-1: Most of the symbols can be used without entity references, but entity names or entity numbers provide a way to express symbols that are not easy to type by keying pins.
Anonymous users2024-01-29

In summary: GB2312 is the national standard, while UTF8 is the international standard, GB2312 only contains Chinese characters and some foreign language codes, while UTF8 contains multiple codes.

We know that computers can't store Chinese characters directly, which requires encoding Chinese characters, GB2312 stores a Chinese character with 2 bytes, while UTF8 needs 4 bytes.

The different ANSI coding standards developed by each country and region specify only the "characters" required for their respective languages. For example, the Chinese character standard (GB2312) does not specify how Korean Chinese characters are stored.

What these ANSI coding standards define has two meanings:

1.Which characters to use. That is, which Chinese characters, letters, and symbols will be included in the standard. The set of "characters" contained in it is called a "character set".

2.The stipulation of whether each "character" should be stored in one byte or multiple bytes, and which bytes should be used to store it, is called "encoding".

When various countries and regions formulate coding standards, "collection of characters" and "coding oak" are generally formulated at the same time. Therefore, what we usually call "character sets", such as GB2312, GBK, JIS, etc., not only have the meaning of "a collection of characters", but also contain the meaning of "encoding".

When designing the program, select the data encoding format according to the application scenario, for example, if you need to sort the Chinese character fields by pinyin, you need to set the GBK encoding (which is a superset of GB2312).