Caterpillar Crawling: Building a Java Application with Nutch 2.0 and Solr 3.6

Apache Nutch 2.0 source @ ~/tmp/apache-solr-3.6.0 SOLR_HOME

Apache Solr 3.6 binary @ ~/tmp/apache-nutch-2.0-src NUTCH_HOME

Apache hbase @ ~/tmp/hbase-0.90.6 HBASE_HOME

1. Preparing libraries in Eclipse

Go to /path/to/solr/dist and open apache-solr-3.6.0.war with your favorite archive manager. Go to /WEB-INF/lib/ and extract everything there to /path/to/solr/dist. This will allow us to include all the libraries we need in our Java application.

2. Creating an Eclipse project

Create a Java project with a class called MyCrawler.

Add a lib folder into the project, then add two sub folders: nutch and solr. Copy everything from NUTCH_HOME/lib to lib/nutch and everything from SOLR_HOME/dist to lib/solr.

Add a plugins folder into the project, add everything from NUTCH_HOME/plugins to it.

Add a urls folder and create a seed.txt file. Put seed urls here.

Create two folders somewhere on the file system: nutchconf and solrconf, and copy all the files from NUTCH_HOME/conf and SOLR_HOME/example/solr/conf to them respectively.

In Eclipse project properties->Java Build Path, click on Add Class folder and add nutchconf and solrconf folders in. Go to Order and Export, find the entries for nutchConf and solrConf and move them to the top.

Change Lucene_core_3.4.0.jar to Lucene_core_3.6.0.jar.

3. Adding code into MyCrawler.

 package websearch.crawler;  
 import java.util.StringTokenizer;  
 import org.apache.hadoop.util.ToolRunner;  
 import org.apache.nutch.indexer.solr.SolrIndexerJob;  
 import org.apache.nutch.crawl.Crawler;  
 import org.apache.nutch.util.NutchConfiguration;  
 import org.apache.solr.client.solrj.SolrQuery;  
 import org.apache.solr.client.solrj.SolrServerException;  
 import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;  
 import org.apache.solr.client.solrj.response.QueryResponse;  
 import org.apache.solr.common.SolrDocumentList;  
 public class MyCrawler {  
      /**  
       * @param args  
       * @throws Exception   
       */  
      public static void main(String[] args) throws Exception {  
           String crawlArg = "urls -depth 3 -topN 5";  
     // Run Crawl tool  
     try {  
         ToolRunner.run(NutchConfiguration.create(), new Crawler(),  
                 tokenize(crawlArg));  
     } catch (Exception e) {  
         e.printStackTrace();  
         return;  
     }  
     String indexArg = "http://localhost:8983/solr -reindex";

     //Run Solr index tool
     try {  
       ToolRunner.run(NutchConfiguration.create(), new SolrIndexerJob(),  
               tokenize(indexArg));  
     } catch (Exception e) {  
       e.printStackTrace();  
       return;  
     }  
     // Let's query for something!  
     String url = "http://localhost:8983/solr";  
     CommonsHttpSolrServer server = new CommonsHttpSolrServer( url );  
     SolrQuery query = new SolrQuery();  
     query.setQuery("content:mycontent"); // Searching mycontent in query  
     query.addSortField("content", SolrQuery.ORDER.asc);  
     QueryResponse rsp;  
     try {  
         rsp = server.query(query);  
     } catch (SolrServerException e) {  
         // TODO Auto-generated catch block  
         e.printStackTrace();  
         return;  
     }  
     // Display the results in the console  
     SolrDocumentList docs = rsp.getResults();  
     for (int i = 0; i < docs.size(); i++) {  
         System.out.println(docs.get(i).get("title").toString() + " Link: "  
                 + docs.get(i).get("url").toString());  
     }    
 }  
      /**  
       * Helper function to convert a string into an array of strings by  
       * separating them using whitespace.  
       *   
       * @param str  
       *      string to be tokenized  
       * @return an array of strings that contain a each word each  
       */  
      public static String[] tokenize(String str) {  
           StringTokenizer tok = new StringTokenizer(str);  
           String tokens[] = new String[tok.countTokens()];  
           int i = 0;  
           while (tok.hasMoreTokens()) {  
                tokens[i] = tok.nextToken();  
                i++;  
           }  
           return tokens;  
      }  
 }

4. Run.
start hbase
start solr
run MyCrawler as Java application

Other resources:

Building a Java application with Apache Nutch and Solr
Solr Wiki: http://wiki.apache.org/solr/Solrj
http://lucidworks.lucidimagination.com/display/solr/Using+SolrJ
regex checker: http://regexpal.com/

Caterpillar Crawling

Monday, October 1, 2012

Building a Java Application with Nutch 2.0 and Solr 3.6

1 comment: