Friday, September 14, 2012

Install Apache Nutch 1.5 and Solr 3.6.0 on Ubuntu 12

1. install jdk
$sudo apt-get install openjdk-7-jdk

2. Download and unpack Solr

sudo mkdir ~/tmp/solr
cd ~/tmp/solr
tar -xzvf apache-solr-3.6.0.tgz
*default jetty in solr, try to run java -jar start.jar* shutdown Ctrl-C

check http://localhost:8983/solr

3. Download and unpack Nutch

sudo mkdir ~/tmp/nutch
cd ~/tmp/nutch
tar -xzvf apache-nutch-1.5-bin.tar.gz

4. configure Nutch
chmod +x bin/nutch
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
add in conf/nutch-site.xml

 <value>My Nutch Spider</value>

mkdir -p urls
cd urls
touch seed.txt
nano seed.txt
add urls for crawling, for example

in conf/regex-urlfilter.txt and replace
# accept anything else

with a regular expression matching the domain you wish to crawl. For example, if you wished to limit the crawl to the domain, the line should read:

5. configure Solr
schema.xml add the following

<fieldType name="text" class="solr.TextField"
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.StopFilterFactory"
                    ignoreCase="true" words="stopwords.txt"/>
                <filter class="solr.WordDelimiterFilterFactory"
                    generateWordParts="1" generateNumberParts="1"
                    catenateWords="1" catenateNumbers="1" catenateAll="0"
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EnglishPorterFilterFactory"
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

<field name="digest" type="text" stored="true" indexed="true"/>
<field name="boost" type="text" stored="true" indexed="true"/>
<field name="segment" type="text" stored="true" indexed="true"/>
<field name="host" type="text" stored="true" indexed="true"/>
<field name="site" type="text" stored="true" indexed="true"/>
<field name="content" type="text" stored="true" indexed="true"/>
<field name="tstamp" type="text" stored="true" indexed="false"/>
<field name="url" type="string" stored="true" indexed="true"/>
<field name="anchor" type="text" stored="true" indexed="false" multiValued="true"/>

change <uniqueKey>id</uniqueKey> to

in solrconfig.xml add
<requestHandler name="/nutch" class="solr.SearchHandler" >
    <lst name="defaults">
       <str name="defType">dismax</str>
       <str name="echoParams">explicit</str>
       <float name="tie">0.01</float>
       <str name="qf">
         content^0.5 anchor^1.0 title^1.2
       <str name="pf">
         content^0.5 anchor^1.5 title^1.2 site^1.5
       <str name="fl">
       <int name="ps">100</int>
       <bool name="hl">true</bool>
       <str name="q.alt">*:*</str>
<str name="hl.fl">title url content</str>
<str name="f.title.hl.fragsize">0</str>
<str name="f.title.hl.alternateField">title</str>
<str name="f.url.hl.fragsize">0</str>
<str name="f.url.hl.alternateField">url</str>
<str name="f.content.hl.fragmenter">regex</str>
6. run Nutch crawler and index in Solr (make sure Solr has started)

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
check indexed files @ http://localhost:8983/solr
ignore error in Nutch running console:
SolrIndexer: starting at 2012-09-14 10:37:49
Indexing 11 documents
SolrIndexer: finished at 2012-09-14 10:38:36, elapsed: 00:00:46
SolrDeleteDuplicates: starting at 2012-09-14 10:38:36
SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
Exception in thread "main" Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(
    at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(
    at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(
    at org.apache.nutch.crawl.Crawl.main(

1 comment:

