Tuesday, September 25, 2012

Nutch Gora HBase Solr

1. download Nutch 2.0
2. download HBase version 0.90.6, nutch 2.0 doesn't work with higher version. 
 follow instructions for installation http://hbase.apache.org/book/quickstart.html
note: if there is a proxy interface setting error, then change localhost/127.0.0.1 to IP address at /etc/hosts; if running in a virtual machine, then change IP address of the virtual host.
3. follow tutorials at http://wiki.apache.org/nutch/Nutch2Tutorial
nutch-site.xml


<property>
 <name>http.agent.name</name>
 <value>Spider</value>
</property>
<property>
 <name>http.robots.agents</name>
 <value>Spider,*</value>
</property>

<property>
 <name>storage.data.store.class</name>
 <value>org.apache.gora.hbase.store.HBaseStore</value>
 <description>Default class for storing data</description>
</property>

<property>
 <name>http.content.limit</name>
<value>-1</value>
</property>

<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

ivy/ivy.xml

<!-- Uncomment this to use HBase as Gora backend. -->
    
    <dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*->default" />

gora.properties


gora.datastore.default=org.apache.gora.hbase.store.HBaseStore


4. ant build
    build files are in /runtime
5. follow instructions of nutch 1.5 and solr

bin/nutch crawl urls  -depth 3 -topN 5 (this stores results in hbase but does not index in Solr)
to index in Solr bin/nutch solrindex http://localhost:8983/solr -reindex 




6. after running nutch, all files are stored in hbase /webpage
   bin/hbase shell
   hbase(main):002:0> scan "webpage"




Friday, September 21, 2012

Eclipse installation on Ubuntu

1. download and extract.
2.
sudo mv eclipse /opt/
cd /opt/eclipse
sudo chown -R root:root eclipse
sudo chmod -R +r eclipse

3. Create an eclipse executable in your path

sudo touch /usr/bin/eclipse
sudo chmod 755 /usr/bin/eclipse
sudo nano /usr/bin/eclipse
copy this into nano
#!/bin/sh
#export MOZILLA_FIVE_HOME="/usr/lib/mozilla/"
export ECLIPSE_HOME="/opt/eclipse"

$ECLIPSE_HOME/eclipse $*
4. Create a gnome menu item
sudo nano /usr/share/applications/eclipse.desktop
copy this into nano
[Desktop Entry]
Encoding=UTF-8
Name=Eclipse
Comment=Eclipse IDE
Exec=eclipse
Icon=/opt/eclipse/icon.xpm
Terminal=false
Type=Application
Categories=GNOME;Application;Development;
StartupNotify=true
save and exit nano
6) Launch Eclipse for the first time
/opt/eclipse/eclipse -clean &

Thursday, September 20, 2012

Solr Query

default for all *:*, : used to separate parameters and values,e.g.

keywords:web

Using AND (&&), OR

content: OWL OR title: Semantic Web
content: OWL && title: Semantic Web
content: OWL AND title: Semantic Web

Nested Queries

using _query_

content:"semantic" AND _query_:"alcategory:computer_internet" AND _query_:"title:web" AND _query_:"keywords:data"

source: http://searchhub.org/dev/2009/03/31/nested-queries-in-solr/

Wednesday, September 19, 2012

Crawling PDF using Apache Nutch

Nutch uses pdfbox plugin for crawling pdf documents. The plugin is located in /plugins/parse-tika

in nutch-site.xml

add:
<property>
 <name>http.content.size</name>
<value>-1</value>
</property>

<property> 

<name>plugin.includes</name> 

<value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|que 
ry-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlno 
rmalizer-(pass|regex|basic)</value> 

</property> 


testing using url: http://www.master.netseven.it/files/262-Nutch.pdf

note:
parsed with unreadable symbols for pdf files converted from Microsoft Word. example URL: http://clgiles.ist.psu.edu/IST441/materials/nutch-lucene/nutch-crawling-and-searching.pdf

Delete Documents from Solr index


Delete all documents from Solr index, enter the following in browser:

http://localhost:8983/solr/update?stream.body=<delete><query>*:*</query></delete>
http://localhost:8983/solr/update?stream.body=<commit/>

Friday, September 14, 2012

Nutch, Solr, UIMA integration




Fields may not be added. Then do the following changes:

<updateRequestProcessorChain name="uima" default="true">


<lst name="analyzeFields">
          <bool name="merge">false</bool>
          <arr name="fields">
            <str>content</str>
          </arr>
        </lst>
Check this link for UIMA AlchemyAPI annotator. http://uima.apache.org/d/uima-addons-current/AlchemyAPIAnnotator/AlchemyAPIAnnotatorUserGuide.html

Install Apache Nutch 1.5 and Solr 3.6.0 on Ubuntu 12



1. install jdk
$sudo apt-get install openjdk-7-jdk

2. Download and unpack Solr

sudo mkdir ~/tmp/solr
cd ~/tmp/solr
wget http://mirror.lividpenguin.com/pub/apache/lucene/solr/3.6.0/apache-solr-3.6.0.tgz
tar -xzvf apache-solr-3.6.0.tgz
*default jetty in solr, try to run java -jar start.jar* shutdown Ctrl-C

check http://localhost:8983/solr

3. Download and unpack Nutch

sudo mkdir ~/tmp/nutch
cd ~/tmp/nutch
wget  http://mirror.rmg.io/apache/nutch/1.5/apache-nutch-1.5-bin.tar.gz
tar -xzvf apache-nutch-1.5-bin.tar.gz

4. configure Nutch
chmod +x bin/nutch
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
add in conf/nutch-site.xml

<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>


mkdir -p urls
cd urls
touch seed.txt
nano seed.txt
add urls for crawling, for example

http://nutch.apache.org/


in conf/regex-urlfilter.txt and replace
# accept anything else
+.

with a regular expression matching the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.apache.org domain, the line should read:
 +^http://([a-z0-9]*\.)*nutch.apache.org/

5. configure Solr
 ~/tmp/solr/apache-solr-3.6.0/example/solr/conf
schema.xml add the following

<fieldType name="text" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.StopFilterFactory"
                    ignoreCase="true" words="stopwords.txt"/>
                <filter class="solr.WordDelimiterFilterFactory"
                    generateWordParts="1" generateNumberParts="1"
                    catenateWords="1" catenateNumbers="1" catenateAll="0"
                    splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EnglishPorterFilterFactory"
                    protected="protwords.txt"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
        </fieldType>


<field name="digest" type="text" stored="true" indexed="true"/>
<field name="boost" type="text" stored="true" indexed="true"/>
<field name="segment" type="text" stored="true" indexed="true"/>
<field name="host" type="text" stored="true" indexed="true"/>
<field name="site" type="text" stored="true" indexed="true"/>
<field name="content" type="text" stored="true" indexed="true"/>
<field name="tstamp" type="text" stored="true" indexed="false"/>
<field name="url" type="string" stored="true" indexed="true"/>
<field name="anchor" type="text" stored="true" indexed="false" multiValued="true"/>

change <uniqueKey>id</uniqueKey> to
<uniqueKey>url</uniqueKey>

in solrconfig.xml add
<requestHandler name="/nutch" class="solr.SearchHandler" >
    <lst name="defaults">
       <str name="defType">dismax</str>
       <str name="echoParams">explicit</str>
       <float name="tie">0.01</float>
       <str name="qf">
         content^0.5 anchor^1.0 title^1.2
       </str>
       <str name="pf">
         content^0.5 anchor^1.5 title^1.2 site^1.5
       </str>
       <str name="fl">
       url
       </str>
       <int name="ps">100</int>
       <bool name="hl">true</bool>
       <str name="q.alt">*:*</str>
<str name="hl.fl">title url content</str>
<str name="f.title.hl.fragsize">0</str>
<str name="f.title.hl.alternateField">title</str>
<str name="f.url.hl.fragsize">0</str>
<str name="f.url.hl.alternateField">url</str>
<str name="f.content.hl.fragmenter">regex</str>
</lst>
</requestHandler>
   
6. run Nutch crawler and index in Solr (make sure Solr has started)

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
 
check indexed files @ http://localhost:8983/solr
 
ignore error in Nutch running console:
SolrIndexer: starting at 2012-09-14 10:37:49
Indexing 11 documents
SolrIndexer: finished at 2012-09-14 10:38:36, elapsed: 00:00:46
SolrDeleteDuplicates: starting at 2012-09-14 10:38:36
SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
    at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
    at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
    at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)