Apache Solr 3.6 binary @ ~/tmp/apache-nutch-2.0-src NUTCH_HOME
Apache hbase @ ~/tmp/hbase-0.90.6 HBASE_HOME
1. Preparing libraries in Eclipse
Go to /path/to/solr/dist and open apache-solr-3.6.0.war with your favorite archive manager. Go to /WEB-INF/lib/ and extract everything there to /path/to/solr/dist. This will allow us to include all the libraries we need in our Java application.
2. Creating an Eclipse project
Create a Java project with a class called MyCrawler.
Add a lib folder into the project, then add two sub folders: nutch and solr. Copy everything from NUTCH_HOME/lib to lib/nutch and everything from SOLR_HOME/dist to lib/solr.
Add a plugins folder into the project, add everything from NUTCH_HOME/plugins to it.
Add a urls folder and create a seed.txt file. Put seed urls here.
Create two folders somewhere on the file system: nutchconf and solrconf, and copy all the files from NUTCH_HOME/conf and SOLR_HOME/example/solr/conf to them respectively.
In Eclipse project properties->Java Build Path, click on Add Class folder and add nutchconf and solrconf folders in. Go to Order and Export, find the entries for nutchConf and solrConf and move them to the top.
Change Lucene_core_3.4.0.jar to Lucene_core_3.6.0.jar.
Change Lucene_core_3.4.0.jar to Lucene_core_3.6.0.jar.
3. Adding code into MyCrawler.
4. Run.
start hbase
start solr
run MyCrawler as Java application
Other resources:
package websearch.crawler;
import java.util.StringTokenizer;
import org.apache.hadoop.util.ToolRunner;
import org.apache.nutch.indexer.solr.SolrIndexerJob;
import org.apache.nutch.crawl.Crawler;
import org.apache.nutch.util.NutchConfiguration;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.common.SolrDocumentList;
public class MyCrawler {
/**
* @param args
* @throws Exception
*/
public static void main(String[] args) throws Exception {
String crawlArg = "urls -depth 3 -topN 5";
// Run Crawl tool
try {
ToolRunner.run(NutchConfiguration.create(), new Crawler(),
tokenize(crawlArg));
} catch (Exception e) {
e.printStackTrace();
return;
}
String indexArg = "http://localhost:8983/solr -reindex";
//Run Solr index tool
try {
ToolRunner.run(NutchConfiguration.create(), new SolrIndexerJob(),
tokenize(indexArg));
} catch (Exception e) {
e.printStackTrace();
return;
}
// Let's query for something!
String url = "http://localhost:8983/solr";
CommonsHttpSolrServer server = new CommonsHttpSolrServer( url );
SolrQuery query = new SolrQuery();
query.setQuery("content:mycontent"); // Searching mycontent in query
query.addSortField("content", SolrQuery.ORDER.asc);
QueryResponse rsp;
try {
rsp = server.query(query);
} catch (SolrServerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
return;
}
// Display the results in the console
SolrDocumentList docs = rsp.getResults();
for (int i = 0; i < docs.size(); i++) {
System.out.println(docs.get(i).get("title").toString() + " Link: "
+ docs.get(i).get("url").toString());
}
}
/**
* Helper function to convert a string into an array of strings by
* separating them using whitespace.
*
* @param str
* string to be tokenized
* @return an array of strings that contain a each word each
*/
public static String[] tokenize(String str) {
StringTokenizer tok = new StringTokenizer(str);
String tokens[] = new String[tok.countTokens()];
int i = 0;
while (tok.hasMoreTokens()) {
tokens[i] = tok.nextToken();
i++;
}
return tokens;
}
}
4. Run.
start hbase
start solr
run MyCrawler as Java application
Other resources:
- Building a Java application with Apache Nutch and Solr
- Solr Wiki: http://wiki.apache.org/solr/Solrj
- http://lucidworks.lucidimagination.com/display/solr/Using+SolrJ
- regex checker: http://regexpal.com/
how to start hbase and how to start solr??
ReplyDelete