Caterpillar Crawling: Crawling PDF using Apache Nutch

Wednesday, September 19, 2012

Crawling PDF using Apache Nutch

Nutch uses pdfbox plugin for crawling pdf documents. The plugin is located in /plugins/parse-tika

in nutch-site.xml

add:
<property>
<name>http.content.size</name>
<value>-1</value>
</property>

<property>

<name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|que
ry-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlno
rmalizer-(pass|regex|basic)</value>

</property>

testing using url: http://www.master.netseven.it/files/262-Nutch.pdf

note:
parsed with unreadable symbols for pdf files converted from Microsoft Word. example URL: http://clgiles.ist.psu.edu/IST441/materials/nutch-lucene/nutch-crawling-and-searching.pdf

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)