## Character format interpretation bug in Notepad

I came across this bug when one of my friends forwarded a message to me describing it. Unfortunately the message missed the technical aspects of the bug. The reason for this bug is quite interesting.

I've verified the bug in Windows XP. Older versions might be affected as well. Here are the steps to reproduce the issue.

2. Type in this sentence exactly as given:
this app can break
3. Save the file.
5. Open the saved file by double clicking it.

Most users would find 9 boxes instead of that string. Some users would find Chinese characters.

Similar thing happens with other strings like:

1. Bush hid the facts
2. Bill hid the facts
3. aa aaa aaa
4. bb bbb bbb

There are many more. You can even craft such strings if you understand what is going on.

Let's take "this app can break" as an example and try to understand what's going on.

The hex-codes for the characters in the string is:

74 68 69 73 20 61 70 70 20 63 61 6e 20 62 72 65 61 6b

Now let us assume that these 18 bytes do not represent characters encoded ANSI or ASCII. Instead let us assume they represent Unicode encoded characters and try to interpret the text now.

After re-arranging them to represent UTF-16LE encoded characters, we get this:

Click on the codes to find out what characters they represent. Each code represents a CJK ideograph! CJK stands for Chinese, Japanese, and Korean.

So, the codes for those 18 ASCII encoded characters also happen to represent 9 valid CJK ideographs when encoded using Unicode.

When Notepad opens a text file, it checks whether the byte stream represents Unicode characters. If it finds that they aren't Unicode characters, it interprets them as ASCII characters and displays the content of the file. In this particular case, Notepad finds the byte stream to be Unicode characters and hence displays them as Unicode characters.

If you find 9 boxes, it's because you don't have CJK fonts installed on your system and hence you can't see the CJK ideographs. Instead, Notepad displays them as boxes.

One of my colleagues asked me after playing a little with this bug, "When I create that file for the first time, I can see 9 boxes. But if I open the same file, delete everything, type the same thing again, close it and open it again, I don't see 9 boxes any more. I can read the text clearly. Does it mean that for some reason this time Notepad can interpret them as ASCII encoded characters?"

No! In fact it is just the opposite. This time Notepad correctly saves them as Unicode encoded characters.

First, Notepad saves the data in ASCII format. The next time it saves it as Unicode.

Let's see why?

After you create that file for the first time, get a dump of that file using the debug program of DOS.

C:\>debug a.txt
-r ip
IP 0100
:
-d 100 11f
0B66:0100  74 68 69 73 20 61 70 70-20 63 61 6E 20 62 72 65  this app can bre
0B66:0110  61 6B BC 00 72 16 03 D3-13 C8 E8 B3 34 00 55 0B  ak..r.......4.U.
-


Next time, when you open the file using Notepad and edit, Notepad considers the text to be composed of Unicode characters due to the reason explained earlier in this post. So, even if you write the same thing again and save, it saves the content in the form of Unicode characters. Let us check the debug dump to verify this.

C:\>debug a.txt
-r ip
IP 0100
:
-d 100 11f
0B66:0100  FF FE 74 00 68 00 69 00-73 00 20 00 61 00 70 00   ..t.h.i.s.  .a.p.
0B66:0110  70 00 20 00 63 00 61 00-6E 00 20 00 62 00 72 00   p. .c.a.n.  .b.r.
-


You can see that all characters have been saved in the Unicode format. Hence, the next time you open the file using Notepad, it correctly interprets them as Unicode characters and displays them accordingly.

#### Anonymous said:

Susam,

It's a very good interpretation. I was saw this thing many times but never understood.

#### Venkat said:

Nice and detailed explaination for Notepad bug.

#### Abhishek Pathak said:

Great explanation!