Caterpillar Crawling: September 2012

Tuesday, September 25, 2012

Nutch Gora HBase Solr

1. download Nutch 2.0
2. download HBase version 0.90.6, nutch 2.0 doesn't work with higher version.
follow instructions for installation http://hbase.apache.org/book/quickstart.html
note: if there is a proxy interface setting error, then change localhost/127.0.0.1 to IP address at /etc/hosts; if running in a virtual machine, then change IP address of the virtual host.
3. follow tutorials at http://wiki.apache.org/nutch/Nutch2Tutorial
nutch-site.xml

<property>
<name>http.agent.name</name>
<value>Spider</value>
</property>
<property>
<name>http.robots.agents</name>
<value>Spider,*</value>
</property>

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>

<property>
<name>http.content.limit</name>
<value>-1</value>
</property>

<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

ivy/ivy.xml

<!-- Uncomment this to use HBase as Gora backend. -->
    
    <dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*->default" />

gora.properties

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

4. ant build
build files are in /runtime
5. follow instructions of nutch 1.5 and solr

bin/nutch crawl urls -depth 3 -topN 5 (this stores results in hbase but does not index in Solr)

to index in Solr bin/nutch solrindex http://localhost:8983/solr -reindex

6. after running nutch, all files are stored in hbase /webpage
bin/hbase shell
hbase(main):002:0> scan "webpage"

Friday, September 21, 2012

Eclipse installation on Ubuntu

1. download and extract.
2.
sudo mv eclipse /opt/
cd /opt/eclipse
sudo chown -R root:root eclipse
sudo chmod -R +r eclipse

3. Create an eclipse executable in your path

sudo touch /usr/bin/eclipse
sudo chmod 755 /usr/bin/eclipse
sudo nano /usr/bin/eclipse

copy this into nano

#!/bin/sh
#export MOZILLA_FIVE_HOME="/usr/lib/mozilla/"
export ECLIPSE_HOME="/opt/eclipse"

$ECLIPSE_HOME/eclipse $*

4. Create a gnome menu item

sudo nano /usr/share/applications/eclipse.desktop

copy this into nano

[Desktop Entry]
Encoding=UTF-8
Name=Eclipse
Comment=Eclipse IDE
Exec=eclipse
Icon=/opt/eclipse/icon.xpm
Terminal=false
Type=Application
Categories=GNOME;Application;Development;
StartupNotify=true

save and exit nano

6) Launch Eclipse for the first time

/opt/eclipse/eclipse -clean &

Thursday, September 20, 2012

Solr Query

default for all *:*, : used to separate parameters and values,e.g.

keywords:web

Using AND (&&), OR

content: OWL OR title: Semantic Web
content: OWL && title: Semantic Web
content: OWL AND title: Semantic Web

Nested Queries

using _query_

content:"semantic" AND _query_:"alcategory:computer_internet" AND _query_:"title:web" AND _query_:"keywords:data"

source: http://searchhub.org/dev/2009/03/31/nested-queries-in-solr/

Wednesday, September 19, 2012

Crawling PDF using Apache Nutch

Delete Documents from Solr index

Delete all documents from Solr index, enter the following in browser:

http://localhost:8983/solr/update?stream.body=<delete><query>*:*</query></delete>
http://localhost:8983/solr/update?stream.body=<commit/>

Friday, September 14, 2012

Nutch, Solr, UIMA integration

Follow instruction @ http://lucidworks.lucidimagination.com/display/solr/UIMA+Integration

Fields may not be added. Then do the following changes:

<updateRequestProcessorChain name="uima" default="true">


<lst name="analyzeFields">
          <bool name="merge">false</bool>
          <arr name="fields">
            <str>content</str>
          </arr>
        </lst>

Check this link for UIMA AlchemyAPI annotator. http://uima.apache.org/d/uima-addons-current/AlchemyAPIAnnotator/AlchemyAPIAnnotatorUserGuide.html

Install Apache Nutch 1.5 and Solr 3.6.0 on Ubuntu 12

1. install jdk
$sudo apt-get install openjdk-7-jdk

2. Download and unpack Solr

sudo mkdir ~/tmp/solr
cd ~/tmp/solr
wget http://mirror.lividpenguin.com/pub/apache/lucene/solr/3.6.0/apache-solr-3.6.0.tgz
tar -xzvf apache-solr-3.6.0.tgz
*default jetty in solr, try to run java -jar start.jar* shutdown Ctrl-C

check http://localhost:8983/solr

3. Download and unpack Nutch

sudo mkdir ~/tmp/nutch
cd ~/tmp/nutch
wget http://mirror.rmg.io/apache/nutch/1.5/apache-nutch-1.5-bin.tar.gz
tar -xzvf apache-nutch-1.5-bin.tar.gz

4. configure Nutch
chmod +x bin/nutch
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
add in conf/nutch-site.xml

<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>

mkdir -p urls
cd urls
touch seed.txt
nano seed.txt
add urls for crawling, for example

http://nutch.apache.org/

in conf/regex-urlfilter.txt and replace

# accept anything else
+.

with a regular expression matching the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.apache.org domain, the line should read:

 +^http://([a-z0-9]*\.)*nutch.apache.org/

5. configure Solr
~/tmp/solr/apache-solr-3.6.0/example/solr/conf
schema.xml add the following

<fieldType name="text" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.StopFilterFactory"
                    ignoreCase="true" words="stopwords.txt"/>
                <filter class="solr.WordDelimiterFilterFactory"
                    generateWordParts="1" generateNumberParts="1"
                    catenateWords="1" catenateNumbers="1" catenateAll="0"
                    splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EnglishPorterFilterFactory"
                    protected="protwords.txt"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
        </fieldType>

<field name="digest" type="text" stored="true" indexed="true"/>
<field name="boost" type="text" stored="true" indexed="true"/>
<field name="segment" type="text" stored="true" indexed="true"/>
<field name="host" type="text" stored="true" indexed="true"/>
<field name="site" type="text" stored="true" indexed="true"/>
<field name="content" type="text" stored="true" indexed="true"/>
<field name="tstamp" type="text" stored="true" indexed="false"/>
<field name="url" type="string" stored="true" indexed="true"/>
<field name="anchor" type="text" stored="true" indexed="false" multiValued="true"/>

change <uniqueKey>id</uniqueKey> to
<uniqueKey>url</uniqueKey>

in solrconfig.xml add
<requestHandler name="/nutch" class="solr.SearchHandler" >
    <lst name="defaults">
       <str name="defType">dismax</str>
       <str name="echoParams">explicit</str>
       <float name="tie">0.01</float>
       <str name="qf">
         content^0.5 anchor^1.0 title^1.2
       </str>
       <str name="pf">
         content^0.5 anchor^1.5 title^1.2 site^1.5
       </str>
       <str name="fl">
       url
       </str>
       <int name="ps">100</int>
       <bool name="hl">true</bool>
       <str name="q.alt">*:*</str>
<str name="hl.fl">title url content</str>
<str name="f.title.hl.fragsize">0</str>
<str name="f.title.hl.alternateField">title</str>
<str name="f.url.hl.fragsize">0</str>
<str name="f.url.hl.alternateField">url</str>
<str name="f.content.hl.fragmenter">regex</str>
</lst>
</requestHandler>

6. run Nutch crawler and index in Solr (make sure Solr has started)

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

check indexed files @ http://localhost:8983/solr

ignore error in Nutch running console:
SolrIndexer: starting at 2012-09-14 10:37:49
Indexing 11 documents
SolrIndexer: finished at 2012-09-14 10:38:36, elapsed: 00:00:46
SolrDeleteDuplicates: starting at 2012-09-14 10:38:36
SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
    at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
    at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
    at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)