Tuesday, September 25, 2012

Nutch Gora HBase Solr

1. download Nutch 2.0
2. download HBase version 0.90.6, nutch 2.0 doesn't work with higher version. 
 follow instructions for installation http://hbase.apache.org/book/quickstart.html
note: if there is a proxy interface setting error, then change localhost/127.0.0.1 to IP address at /etc/hosts; if running in a virtual machine, then change IP address of the virtual host.
3. follow tutorials at http://wiki.apache.org/nutch/Nutch2Tutorial
nutch-site.xml


<property>
 <name>http.agent.name</name>
 <value>Spider</value>
</property>
<property>
 <name>http.robots.agents</name>
 <value>Spider,*</value>
</property>

<property>
 <name>storage.data.store.class</name>
 <value>org.apache.gora.hbase.store.HBaseStore</value>
 <description>Default class for storing data</description>
</property>

<property>
 <name>http.content.limit</name>
<value>-1</value>
</property>

<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

ivy/ivy.xml

<!-- Uncomment this to use HBase as Gora backend. -->
    
    <dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*->default" />

gora.properties


gora.datastore.default=org.apache.gora.hbase.store.HBaseStore


4. ant build
    build files are in /runtime
5. follow instructions of nutch 1.5 and solr

bin/nutch crawl urls  -depth 3 -topN 5 (this stores results in hbase but does not index in Solr)
to index in Solr bin/nutch solrindex http://localhost:8983/solr -reindex 




6. after running nutch, all files are stored in hbase /webpage
   bin/hbase shell
   hbase(main):002:0> scan "webpage"




1 comment:

  1. I have setup Nutch, Solr, and HBase. I'm unable to use the craw script as it generates an error, so I have used the step-by-step commands.
    1 - The inject, generate, and parse commands are successful
    2 - I am able to see the table/data in HBase and verify that page has been fetched and parsed.
    3 - when i run solr index, it doesn't load any documents. So, I'm missing a step in configuring solr with Nutch/Hbase.
    4 - The solr log dosn't show any errors.

    Does anyone have any advice?

    ReplyDelete