Nutch uses pdfbox plugin for crawling pdf documents. The plugin is located in /plugins/parse-tika
in nutch-site.xml
add:
<property>
<name>http.content.size</name>
<value>-1</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|que
ry-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlno
rmalizer-(pass|regex|basic)</value>
</property>
testing using url: http://www.master.netseven.it/files/262-Nutch.pdf
note:
parsed with unreadable symbols for pdf files converted from Microsoft Word. example URL: http://clgiles.ist.psu.edu/IST441/materials/nutch-lucene/nutch-crawling-and-searching.pdf
No comments:
Post a Comment