Logo C++ Online Judge

C++

时间限制:1 s 空间限制:512 MB
TA: 苏凯
统计

Computers use number to represent characters. For example, (int)'0'==48 in C/C++. This is the ASCII standard for mapping characters into numbers, which is practically implemented by every computer system today.

We only have 26 letters in the English dictionary, but there are much more for Chinese, Japanese, Korean, and other languages. ASCII is not enough. We can define our custom mapping to include all characters we want. Historically, there was GB2312, GBK, etc. for Chinese, but the de facto coding today is Unicode, which includes characters from almost all languages with some emoji extensions.

If you are seeing 锟斤拷�� or something similar, you might be using a wrong encoding. It shows up weirdly because the numbers in one encoding system is decoded in another system. For your reference, Windows cmd.exe refer GBK as cp936, and Unicode as cp65001.

Unicode use numbers from 0 to 0x10FFFF. However, when we read from a file, the smallest unit is byte. A byte is 8 bits long and represent at most $2^8=256$ different possibilities, which is way smaller than what Unicode needs. We can combine every 4 bytes to get a number in range $[0,2^{32})$. This is how we derive UTF-32.

When we switch from ASCII to Unicode, what took 1 byte now takes 4 bytes. This increases the file size a lot. We observe that many characters are not used very often. Can we use less bytes for frequently used characters, and read more bytes when it's not enough? Here's UTF-16:

  • Read 2 bytes at a time and convert it to a number in 0x0000 to 0xFFFF
  • If the number is in 0xD800–0xDBFF, it's a high surrogate providing 10 bits.
    • The next number must be in 0xDC00–0xDFFF, providing another 10 bits.
    • The high surrogate and the low surrogate form a surrogate pair.
    • They together form an integer of 20 bits, which is enough for Unicode range 0x010000-0x10FFFF.
      U' = yyyyyyyyyyxxxxxxxxxx  // U - 0x10000
      H  = 110110yyyyyyyyyy      // 0xD800 + yyyyyyyyyy
      L  = 110111xxxxxxxxxx      // 0xDC00 + xxxxxxxxxx
  • Otherwise, use the number directly as a code point.

You are required to decode a UTF-16BE file manually. "BE" means big endian, or "largest bit come first". When you read two bytes into an array a, a[0] is the first byte and you reconstruct the number like this: (int(a[0])<<8)+a[1].

You need to read raw bytes instead of scanf. Here's how:

#include <stdio.h>
#include <stdlib.h>

int main() {
    size_t max_num_bytes = 1024;  // Number of bytes to read
    char *buffer = malloc(max_num_bytes);  // Allocate memory for the buffer
    size_t bytesRead = fread(buffer, 1, max_num_bytes, stdin); // bytesRead is the actual number of bytes read

    /* Your code here */
}

Example Input

🙌👐👌☝️ (Don't copy this, download the example file!)

Example Output

128588 128080 128076 9757 65039

Utilities

You can use VSCode to view the downloaded file. It will prompt you that it cannot identify the encoding (because there isn't a BOM in the example file). We can change it at the bottom-right corner. Click on "UTF-8", "Reopen with Encoding", and finally "UTF-16 BE".

  • Unicode Home of the Unicode Consortium
  • Emojipedia Search for Unicode emojis😊
  • Compart An easy to use Unicode database

Optional Reading for Fun

We can see that some characters take 2 bytes, and some characters take 4 bytes. In today's modern systems, we use a even more compact encoding UTF-8, which uses 1 to 4 bytes. It is fully compatible with ASCII, meaning that if a piece of data is encoded using ASCII, it is valid UTF-8. However, such variable length encoding makes it difficult to find the i-th character in a string, because its address depends on previous contents. Some systems using UTF-16, e.g. Java/JavaScript, could cause problems dealing with characters in the range 0x010000-0x10FFFF. A direct consequence is that in some web editors, an emoji 🚀 is counted as two characters and you might move the cursor into the middle of the character! Python solves this by automatically switching to UTF-32 when needed, at the cost of larger memory space. It also distincts between strings, which are always Unicode, and bytes, which are always 8 bits each. For invalid UTF-8 strings, Python can use the low surrogate to represent the raw byte, unifying the interface for file systems on all platforms. Details like this make Python slow, but easy to use.