Today, we received an announcement from the Nutch committer, Sami Siren that Apache Nutch 1.0 has been released. An extract from the announcement:

Apache Nutch, a subproject of Apache Lucene, is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats.

Apache Nutch 1.0 contains a number of bug fixes and improvements such as Solr Integration, new indexing framework and new scoring framework just to mention a few. Details can be found in the changes file:

Apache Nutch is available for download from the following download page:

I have been waiting for this release for a long time as I made some contributions to this project and I wanted them to be available in official release so that I didn't have to maintain a separate set of patches for myself. These contributions were also my first contributions to an open source project. Let me list my contributions from the CHANGES.txt file.

62. NUTCH-559 - NTLM, Basic and Digest Authentication schemes for web/proxy
    server. (Susam Pal via dogacan)

77. NUTCH-44 - Too many search results, limits max results returned from a 
    single search. (Emilijan Mirceski and Susam Pal via kubes)

80. NUTCH-612 - URL filtering was disabled in Generator when invoked
    from Crawl (Susam Pal via ab)

81. NUTCH-601 - Recrawling on existing crawl directory (Susam Pal via ab)

In 2007, while playing with the search engine, I found that there was no way for Nutch to authenticate itself to intranet sites requiring HTTP authentication. So, I modified the module that deals with the HTTP protocol so that it could authenticate itself with configured credentials when challenged with authentication. With this change, Nutch now supports NTLM, Basic and Digest authentication schemes. More details on this can be found in NUTCH-559 (JIRA) and the Nutch wiki entry on HTTP authentication schemes.

NUTCH-44 and NUTCH-612 were bug fixes. NUTCH-601 involved the removal of a minor irritant. In the days of Nutch 0.9, the crawler complained if a directory with the name 'crawl' already existed in the current directory. As a result, before beginning a re-crawl using the bin/nutch crawl command, we had to move the existing crawl directory to another location. After a discussion in the community, we agreed that it was better to avoid shuffling the crawl directories by allowing re-crawls on the same directory. The change was made and committed.

Nutch users' mailing list has often received mails from users who wanted to know how they can enable support for authentication schemes in Nutch 0.9 by applying the patch in NUTCH-559. Patching Nutch 0.9 was a little cumbersome as the patch was generated against the trunk. With this release, the users can simply download Nutch 1.0 and configure the authentication schemes.


Life Mysterious said:

Congrats and thanx!

I'm sure it would have been a long wait for a lot of users for configuring their authentication schemes.

Paritosh said:

Hey Dude,

Finally caught up with your blog. Nice posts, quite informative. Have started following your blog.


Utkarshraj Atmaram said:

Congrats! Good to see your work being useful to the open source community.

Post a comment