Apache Nutch, a subproject of Apache Lucene, is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats.
Apache Nutch 1.0 contains a number of bug fixes and improvements such as Solr Integration, new indexing framework and new scoring framework just to mention a few. Details can be found in the changes file:
Apache Nutch is available for download from the following download page: http://www.apache.org/dyn/closer.cgi/nutch/
I have been waiting for this release for a long time as I made some contributions to this project and I wanted them to be available in official release so that I didn't have to maintain a separate set of patches for myself. These contributions were also my first contributions to an open source project. Let me list my contributions from the CHANGES.txt file.
62. NUTCH-559 - NTLM, Basic and Digest Authentication schemes for web/proxy server. (Susam Pal via dogacan) 77. NUTCH-44 - Too many search results, limits max results returned from a single search. (Emilijan Mirceski and Susam Pal via kubes) 80. NUTCH-612 - URL filtering was disabled in Generator when invoked from Crawl (Susam Pal via ab) 81. NUTCH-601 - Recrawling on existing crawl directory (Susam Pal via ab)
In 2007, while playing with the search engine, I found that there was no way for Nutch to authenticate itself to intranet sites requiring HTTP authentication. So, I modified the module that deals with the HTTP protocol so that it could authenticate itself with configured credentials when challenged with authentication. With this change, Nutch now supports NTLM, Basic and Digest authentication schemes. More details on this can be found in NUTCH-559 (JIRA) and the Nutch wiki entry on HTTP authentication schemes.
were bug fixes. NUTCH-601
involved the removal of a minor irritant. In the days of Nutch 0.9, the
crawler complained if a directory with the name 'crawl' already existed
in the current directory. As a result, before beginning a re-crawl using
bin/nutch crawl command, we had to move the existing
crawl directory to another location. After a discussion in the
community, we agreed that it was better to avoid shuffling the crawl
directories by allowing re-crawls on the same directory. The change was
made and committed.
Nutch users' mailing list has often received mails from users who wanted to know how they can enable support for authentication schemes in Nutch 0.9 by applying the patch in NUTCH-559. Patching Nutch 0.9 was a little cumbersome as the patch was generated against the trunk. With this release, the users can simply download Nutch 1.0 and configure the authentication schemes.