## Apache Nutch 1.0 released

Today, we received an announcement from the Nutch committer, Sami Siren that Apache Nutch 1.0 has been released. An extract from the announcement:

Apache Nutch, a subproject of Apache Lucene, is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats.

Apache Nutch 1.0 contains a number of bug fixes and improvements such as Solr Integration, new indexing framework and new scoring framework just to mention a few. Details can be found in the changes file:

I have been waiting for this release for a long time as I made some contributions to this project and I wanted them to be available in official release so that I didn't have to maintain a separate set of patches for myself. These contributions were also my first contributions to an open source project. Let me list my contributions from the CHANGES.txt file.

62. NUTCH-559 - NTLM, Basic and Digest Authentication schemes for web/proxy
server. (Susam Pal via dogacan)

77. NUTCH-44 - Too many search results, limits max results returned from a
single search. (Emilijan Mirceski and Susam Pal via kubes)

80. NUTCH-612 - URL filtering was disabled in Generator when invoked
from Crawl (Susam Pal via ab)

81. NUTCH-601 - Recrawling on existing crawl directory (Susam Pal via ab)


In 2007, while playing with the search engine, I found that there was no way for Nutch to authenticate itself to intranet sites requiring HTTP authentication. So, I modified the module that deals with the HTTP protocol so that it could authenticate itself with configured credentials when challenged with authentication. With this change, Nutch now supports NTLM, Basic and Digest authentication schemes. More details on this can be found in NUTCH-559 (JIRA) and the Nutch wiki entry on HTTP authentication schemes.

NUTCH-44 and NUTCH-612 were bug fixes. NUTCH-601 involved the removal of a minor irritant. In the days of Nutch 0.9, the crawler complained if a directory with the name 'crawl' already existed in the current directory. As a result, before beginning a re-crawl using the bin/nutch crawl command, we had to move the existing crawl directory to another location. After a discussion in the community, we agreed that it was better to avoid shuffling the crawl directories by allowing re-crawls on the same directory. The change was made and committed.

Nutch users' mailing list has often received mails from users who wanted to know how they can enable support for authentication schemes in Nutch 0.9 by applying the patch in NUTCH-559. Patching Nutch 0.9 was a little cumbersome as the patch was generated against the trunk. With this release, the users can simply download Nutch 1.0 and configure the authentication schemes.

With Firefox you can masquerade as your phone in a matter of minutes.

• Install User Agent Switcher.
• Open the built-in browser of your phone and visit: UserAgentString.com. This website will show the User-Agent string it receives from the HTTP User-Agent header sent by your browser.
• From your Firefox menu, select: Tools > User Agent Switcher > Options > Options > User Agent > Add
• Enter some description in the 'Description' field. In the 'User Agent' field enter the user-agent string carefully and save it.
• From the menu, select: Tools > User Agent Switcher > [The user agent you just added].

Now, you can browse sites like Google, Yahoo, etc. to get a feel of how these sites are going to look like on your cell phone. The web servers would respond to you in the same manner it would respond to your cell phone.

I was trying to download Opera Mini 4 Beta for my cell phone. But sadly, I found that the website did not have a direct link to the MIDlet jar file for my phone. Instead, it asked me to visit http://mini.opera.com/beta with my cell phone and download the 97 KB file. To save my 1 rupee, I decided to masquerade as the Sony Ericsson K750i browser from my PC. I did it with netcat (the TCP/IP Swiss Army Knife) once. It worked but it was cumbersome. The method discussed above is an easier way to do it.

With the phone's user agent string set in Firefox, you get an XML file on visiting http://mini.opera.com/beta. Save it and find a link to a JAD file in the code. (For me, it was: /beta/mini.jad?rnd=1999705223&edition=hifi. So I could download it from http://mini.opera.com/beta/mini.jad?rnd=1999705223&edition=hifi.) Look for the 'MIDlet-Jar-URL' attribute in the JAD file. It is the URL to the JAR file for your phone. Download it and install it.

## Live Demo of Orkut Session Management Issue

On June 22, 2007 I posted a full-disclosure regarding an Orkut session management issue. A week later, on June 29, 2007 I posted another full-disclosure regarding a similar issue in Google. The cause of both the issues are same. The session associated with an Orkut user does not expire even after the user logs out which is a bad design from a security perspective.

A couple of days later, on the basis of these posts, a live demonstration of Orkut session hijacking was posted in the same mailing list. The results of the experiment experiment were shared yesterday. It confirmed that an Orkut session remains alive for at least 7 days after the user has logged out.

The issue is not a critical one at the moment because it requires stealing the cookies. If a cross-site scripting (XSS) flaw is disclosed before this flaw is fixed, it can cause a great deal of mayhem because attackers can then use the XSS flaw to steal the session cookies. Once they have the session cookies, they can misuse the compromised account even after the user has logged out as a result of this issue.

Let us hope this gets fixed soon.

Update: July 15, 2007: The live demonstration of Orkut session hijacking confirmed that an Orkut session remains alive for 14 days after the user has logged out.

## Character format interpretation bug in Notepad

I came across this bug when one of my friends forwarded a message to me describing it. Unfortunately the message missed the technical aspects of the bug. The reason for this bug is quite interesting.

I've verified the bug in Windows XP. Older versions might be affected as well. Here are the steps to reproduce the issue.

2. Type in this sentence exactly as given:
this app can break
3. Save the file.
5. Open the saved file by double clicking it.

Most users would find 9 boxes instead of that string. Some users would find Chinese characters.

Similar thing happens with other strings like:

1. Bush hid the facts
2. Bill hid the facts
3. aa aaa aaa
4. bb bbb bbb

There are many more. You can even craft such strings if you understand what is going on.

Let's take "this app can break" as an example and try to understand what's going on.

The hex-codes for the characters in the string is:

74 68 69 73 20 61 70 70 20 63 61 6e 20 62 72 65 61 6b

Now let us assume that these 18 bytes do not represent characters encoded ANSI or ASCII. Instead let us assume they represent Unicode encoded characters and try to interpret the text now.

After re-arranging them to represent UTF-16LE encoded characters, we get this:

Click on the codes to find out what characters they represent. Each code represents a CJK ideograph! CJK stands for Chinese, Japanese, and Korean.

So, the codes for those 18 ASCII encoded characters also happen to represent 9 valid CJK ideographs when encoded using Unicode.

When Notepad opens a text file, it checks whether the byte stream represents Unicode characters. If it finds that they aren't Unicode characters, it interprets them as ASCII characters and displays the content of the file. In this particular case, Notepad finds the byte stream to be Unicode characters and hence displays them as Unicode characters.

If you find 9 boxes, it's because you don't have CJK fonts installed on your system and hence you can't see the CJK ideographs. Instead, Notepad displays them as boxes.

One of my colleagues asked me after playing a little with this bug, "When I create that file for the first time, I can see 9 boxes. But if I open the same file, delete everything, type the same thing again, close it and open it again, I don't see 9 boxes any more. I can read the text clearly. Does it mean that for some reason this time Notepad can interpret them as ASCII encoded characters?"

No! In fact it is just the opposite. This time Notepad correctly saves them as Unicode encoded characters.

First, Notepad saves the data in ASCII format. The next time it saves it as Unicode.

Let's see why?

After you create that file for the first time, get a dump of that file using the debug program of DOS.

C:\>debug a.txt
-r ip
IP 0100
:
-d 100 11f
0B66:0100  74 68 69 73 20 61 70 70-20 63 61 6E 20 62 72 65  this app can bre
0B66:0110  61 6B BC 00 72 16 03 D3-13 C8 E8 B3 34 00 55 0B  ak..r.......4.U.
-


Next time, when you open the file using Notepad and edit, Notepad considers the text to be composed of Unicode characters due to the reason explained earlier in this post. So, even if you write the same thing again and save, it saves the content in the form of Unicode characters. Let us check the debug dump to verify this.

C:\>debug a.txt
-r ip
IP 0100
:
-d 100 11f
0B66:0100  FF FE 74 00 68 00 69 00-73 00 20 00 61 00 70 00   ..t.h.i.s.  .a.p.
0B66:0110  70 00 20 00 63 00 61 00-6E 00 20 00 62 00 72 00   p. .c.a.n.  .b.r.
-


You can see that all characters have been saved in the Unicode format. Hence, the next time you open the file using Notepad, it correctly interprets them as Unicode characters and displays them accordingly.

## : () { : | : & } ; : A cool hack for Linux geeks

Have a close look at this one-liner that can be executed. Beware! Don't execute it on a box without understanding the consequences completely.

: () { : | : & } ; :
Do you find this thing too geeky? We'll simplify it. It is deliberately obfuscated.

The : is a function name. It could very well have been f.

Let us replace : with f and see what it looks like.

f () { f | f & } ; f


Now it looks familiar. We have two commands separated by a semi-colon:

f()
{
f | f&
}

f

So with that one liner command, we create a function f and then execute it.

This function calls itself recursively. Soon, the box is full of many instances of this process.

It can be seen that the function is called twice. So, we see an exponential growth in the number of processes. The & runs them in the background.

This one-liner command is actually a fork bomb. On execution, very soon the system is full of thousands of that function thereby depleting CPU cycles, memory and process table. The box is rendered useless. So you should try it on your box only.