Assignment Overview
This question requires the use of μC++, which means compiling the program with the u++ command and replacing routine main with member uMain::main.
Write a semi-coroutine to verify a string of bytes is a valid Unicode Transformation Format 8-bit character (UTF-8). UTF-8 allows any universal character to be represented while maintaining full backwards-compatibility with ASCII encoding, which is achieved by using a variable-length encoding. The following table provides a summary of the Unicode value ranges in hexadecimal, and how they are represented in binary for UTF-8.
Unicode ranges UTF-8 binary encoding
000000-00007F 0xxxxxxx
000080-0007FF 110xxxxx 10xxxxxx
000800-00FFFF 1110xxxx 10xxxxxx 10xxxxxx
010000-10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
(UTF-8 is restricted to U+10FFFF so itmatches the constraints of the UTF-16 character encoding.)
For example, the symbol £ is represented by Unicode value 0xA3 (binary 1010 0011). Since £ falls within the range of 0x80 to 0x7FF, it is encoded by the UTF-8 bit string 110xxxxx 10xxxxxx. To fit the character into the eleven bits of the UTF-8 encoding, it is padded on the left with zeroes to 00010100011. The UTF-8 encoding becomes 11000010 10100011, where the x’s are replaced with the 11-bit binary encoding giving the UTF-8 character encoding 0xC2A3 for symbol £. Note, UTF-8 is a minimal encoding; e.g., it is incorrect to represent the value 0 by any encoding other than the first one. Use unformatted I/O (ifstream::read) to read the Unicode bytes and the public interface given in Figure 3 (you may only add a public constructor/destructor and private members). After creation, the coroutine is resumed with a series of bytes from a string (one byte at a time). The coroutine suspends after each byte or throws one of these two exception:
• Match means the bytes form a valid encoding and no more bytes can be sent
• Error means the bytes form an invalid encoding and no more bytes can be sent. Throw Error for an incorrectly encoded-byte or a correct encoding that falls outside the accepted range.
For example, the bytes 0xe09390 (11100000 1001 0011 1001 000) form a 3-byte UTF-8 character (known from the first byte). However, the character is invalid because its value 0x4d0 (xx010011 xx010000) < 0x800, the lower bound for a 3-byte UTF-8 character. This error can be detected at the second byte, but all three bytes must be read to allow finding the start of the next UTF-8 character for a sequence of UTF-8 characters or extra characters in this question.
After the coroutine throws an exception, it must terminate; sending more bytes to the coroutine after this point is undefined. Write a program utf8 that checks if a string follows the UTF-8 encoding. The shell interface to the utf8 program is as follows:
utf8 [ filename ]
(Square brackets indicate optional command line parameters, and do not appear on the actual command line.) If no input file name is specified, input comes from standard input. Output is sent to standard output. Issue appropriate runtime error messages for incorrect usage or if a file cannot be opened.