2. download HBase version 0.90.6, nutch 2.0 doesn't work with higher version.
follow instructions for installation http://hbase.apache.org/book/quickstart.html
note: if there is a proxy interface setting error, then change localhost/127.0.0.1 to IP address at /etc/hosts; if running in a virtual machine, then change IP address of the virtual host.
3. follow tutorials at http://wiki.apache.org/nutch/Nutch2Tutorial
nutch-site.xml
<property>
<name>http.agent.name</name>
<value>Spider</value>
</property>
<property>
<name>http.robots.agents</name>
<value>Spider,*</value>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>http.content.limit</name>
<value>-1</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
ivy/ivy.xml
<!-- Uncomment this to use HBase as Gora backend. --> <dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*->default" />
gora.properties
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
4. ant build
build files are in /runtime
5. follow instructions of nutch 1.5 and solr
bin/nutch crawl urls -depth 3 -topN 5 (this stores results in hbase but does not index in Solr)
to index in Solr bin/nutch solrindex http://localhost:8983/solr -reindex
6. after running nutch, all files are stored in hbase /webpage
bin/hbase shell
hbase(main):002:0> scan "webpage"
I have setup Nutch, Solr, and HBase. I'm unable to use the craw script as it generates an error, so I have used the step-by-step commands.
ReplyDelete1 - The inject, generate, and parse commands are successful
2 - I am able to see the table/data in HBase and verify that page has been fetched and parsed.
3 - when i run solr index, it doesn't load any documents. So, I'm missing a step in configuring solr with Nutch/Hbase.
4 - The solr log dosn't show any errors.
Does anyone have any advice?