Wednesday, September 19, 2012

Crawling PDF using Apache Nutch

Nutch uses pdfbox plugin for crawling pdf documents. The plugin is located in /plugins/parse-tika

in nutch-site.xml

add:
<property>
 <name>http.content.size</name>
<value>-1</value>
</property>

<property> 

<name>plugin.includes</name> 

<value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|que 
ry-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlno 
rmalizer-(pass|regex|basic)</value> 

</property> 


testing using url: http://www.master.netseven.it/files/262-Nutch.pdf

note:
parsed with unreadable symbols for pdf files converted from Microsoft Word. example URL: http://clgiles.ist.psu.edu/IST441/materials/nutch-lucene/nutch-crawling-and-searching.pdf

No comments:

Post a Comment