Character Encoding Bug in Notepad

By Susam Pal on 19 Jun 2006

Symptoms of the Bug

I came across an interesting bug in Notepad last week. I have verified that this bug is reproducible in Windows XP. Older versions might be affected as well. Here are the steps to reproduce the issue:

  1. Open Notepad.
  2. Enter the following text exactly as shown here:
    this app can break
  3. Save the file.
  4. Close Notepad.
  5. Open the file again with Notepad.

Some users may find Chinese characters instead of the English text that was entered. Others may find 9 boxes instead.

A similar issue happens with other strings like the following ones:

Bush hid the facts
Bill hid the facts
aa aaa aaa
bb bbb bbb

We can craft many more such strings that trigger this bug if we understand what causes this bug.

Cause of the Bug

Let us take the following text as an example and try to understand what is going on:

this app can break

Here are the hexadecimal codes for the characters in the string:

74 68 69 73 20 61 70 70 20 63 61 6e 20 62 72 65 61 6b

Now let us try to interpret these 18 bytes as if they represent UTF-16LE encoded characters. After rearranging them to represent UTF-16LE encoded characters, we get 9 UTF-16LE encoded characters with the following code points:

6874 7369 6120 7070 6320 6e61 6220 6572 6b61

Click the codes above to see what the characters they represent look like. Each code represents a CJK ideograph. CJK stands for Chinese, Japanese, and Korean.

We can see now that the 18 bytes entered into Notepad also happen to represent 9 valid CJK ideographs when encoded using UTF-16LE. When Notepad opens a text file, it finds that the bytes in the file happen to be valid UTF-16LE characters, so it attempts to display them as the corresponding UTF-16LE characters. Those who do not have CJK fonts installed on their systems see them appear as boxes.

Rewriting the File

One of my friends, after playing a little with this bug, asked me, "When I create that file for the first time, I see 9 boxes. But if I open the same file, delete everything, type the same thing again, close it, and open it again, I don't see 9 boxes any more. I can read the English text without issues now. Does it mean that for some reason this time Notepad can interpret them as ASCII encoded characters?"

The answer is: No! In fact it is just the opposite. This time Notepad correctly saves them as UTF-16LE encoded characters.

The first time Notepad saves the data in ASCII encoding. The next time it saves it as UTF-16LE encoded characters.

Let us create the file for the first time and see what each byte looks like using the debug program of DOS.

C:\>debug foo.txt
-r ip
IP 0100
:
-d 100 11f
0B66:0100  74 68 69 73 20 61 70 70-20 63 61 6E 20 62 72 65  this app can bre
0B66:0110  61 6B BC 00 72 16 03 D3-13 C8 E8 B3 34 00 55 0B  ak..r.......4.U.
-

When we open this file using Notepad and edit, Notepad considers the text to be in UTF-16LE encoding due to reasons explained earlier in this post. Therefore it displays the text as CJK ideographs or boxes (if CJK fonts are missing). Now when we erase the text and write the same English text again, the English text is saved in UTF-16LE encoding (not ASCII encoding like the first time). This can be confirmed with the debug command.

C:\>debug a.txt
-r ip
IP 0100
:
-d 100 11f
0B66:0100  FF FE 74 00 68 00 69 00-73 00 20 00 61 00 70 00   ..t.h.i.s.  .a.p.
0B66:0110  70 00 20 00 63 00 61 00-6E 00 20 00 62 00 72 00   p. .c.a.n.  .b.r.
-

The two bytes FF and EE in the beginning is the byte order mark (BOM) for UTF-16LE encoding. The remaining bytes are the characters of the text in UTF-16LE encoding.