Tuesday, October 30, 2012

Ubuntu Server 12.04 LTS, Eucalyptus and Xen

In this installation, 2 Dell machines (Dual Core, 4GB RAM, 250 GB hard disk) and 1 IBM x3650 machine are used. 

1. Install Eucalyptus cloud and cluster controller (Dell)

1.1. Ubuntu Server 12.04 LTS 
  • default installation on openssh
  • make network is set correctly
1.2 Install postfix which will be used by Eucalyptus for sending confirmation emails to new registered users.
1.3 Install Eucalyptus
1.3.1 download and add pub key
  • sudo apt-key add c1240596-eucalyptus-release-key.pub
 
1.3.2 add deb
Create a file in /etc/apt/sources.list.d called eucalyptus.list with the following content:


  • deb http://downloads.eucalyptus.com/software/eucalyptus/3.1/ubuntu precise main

On all machines that will run either Eucalyptus or Euca2ools, create a file in/etc/apt/sources.list.d called euca2ools.list with the following content:
  • deb http://downloads.eucalyptus.com/software/euca2ools/2.1/ubuntu precise main
1.3.3 install
  • sudo apt-get update  
  • sudo apt-get install eucalyptus-cloud eucalyptus-cc eucalyptus-walrus
1.3.4 start
  • /usr/sbin/euca_conf --initialize (for first time)
    sudo service eucalyptus-cloud start
    sudo service eucalyptus-cc start 

    1.3.5 verify
    check on Web browser: https://<ip>:8443
first time login will allow changing password, set email address admin/admin


2. Install Eucalyptus node controller (IBM)

2.1 Repeat 1.1, 1.3.1, 1.3.2
2.2 install Xen
  • sudo apt-get install iproute  iptables module-init-tools python2.5 python2.6
  • sudo apt-get  install xen-utils
2.3 Modify GRUB to boot default on Xen
sudo sed -i 's/GRUB_DEFAULT=.*\+/GRUB_DEFAULT="Xen 4.1-amd64"/' /etc/default/grub
sudo update-grub
 
Set the default toolstack to xm (aka xend):
sudo sed -i 's/TOOLSTACK=.*\+/TOOLSTACK="xm"/' /etc/default/xen
Now reboot:
sudo reboot
And then verify that the installation has succeeded: 
sudo xm list
Name ID Mem VCPUs State Time(s)
Domain-0 0 945 1 r----- 11.3

2.4 Network configuration
check eth0 or eth1 is used. the following assumes eth0 and dhcp are used.

sudo apt-get install bridge-utils
 
Edit /etc/network/interfaces, and add this:  

auto xenbr0
iface xenbr0 inet dhcp
    bridge_ports eth0
 
Restart networking to enable xenbr0 bridge: 
 
sudo service networking restart


 
2.5 install Eucalyptus node controller
  • sudo apt-get install eucalyptus-nc
    sudo service eucalyptus-nc start

3. Register

 Register CC (at CLC machine)

sudo /usr/sbin/euca_conf --register-cluster --partition cluster01  --host  10.1.62.178 --component  cc-10.1.62.178
(this may require root access: using sudo passwd to add a new password for root)

Register NC (at CC machine)


sudo /usr/sbin/euca_conf --register-nodes "<node0_IP_address> ... <nodeN_IP_address>"

Register walrus at CLC machine

sudo /usr/sbin/euca_conf --register-walrus --partition walrus --host <walrus_IP_address> --component <walrus-hostname>


4. Install Eucalyptus client machine (Dell)

4.1 install Ubuntu Desktop 12.04 LTS
4.2 repeat 1.3.1 and 1.3.2
4.3 install euca2ools
  • sudo apt-get install euca2ools
apply a new user account from @8443

Get admin credentials
login as eucalyptus/admin @ 8443

download new credentials from admin@eucalyptus
unzip to ~/.euca
mkdir ~/.euca cd ~/.euca unzip <filepath>/euca2-<user>-x509.zip
source eucarc
<*remember to use admin credential to source*>

4.4 Add image from eustore
eustore-describe-images
eustore-install-image -b testbucket -i xxxxxxxx -k xen
<*this will take a while*>
the results look like as follows:

Downloading Image :  CentOS 5 1.3GB root, Hypervisor-Specific Kernels
0-----1-----2-----3-----4-----5-----6-----7-----8-----9-----10
##############################################################

Checking image bundle
Unbundling image
going to look for kernel dir : xen-kernel
Bundling/uploading kernel
Checking image
Compressing image
Encrypting image
Splitting image...
Part: vmlinuz-2.6.27.21-0.1-xen.part.00
Generating manifest /tmp/RXAdDH/vmlinuz-2.6.27.21-0.1-xen.manifest.xml
Checking bucket: testbucket
Creating bucket: testbucket
Uploading manifest file
Uploading part: vmlinuz-2.6.27.21-0.1-xen.part.00
Uploaded image as testbucket/vmlinuz-2.6.27.21-0.1-xen.manifest.xml
testbucket/vmlinuz-2.6.27.21-0.1-xen.manifest.xml
eki-855C3923
Bundling/uploading ramdisk
Checking image
Compressing image
Encrypting image
Splitting image...
Part: initrd-2.6.27.21-0.1-xen.part.00
Generating manifest /tmp/RXAdDH/initrd-2.6.27.21-0.1-xen.manifest.xml
Checking bucket: testbucket
Uploading manifest file
Uploading part: initrd-2.6.27.21-0.1-xen.part.00
Uploaded image as testbucket/initrd-2.6.27.21-0.1-xen.manifest.xml
testbucket/initrd-2.6.27.21-0.1-xen.manifest.xml
eri-05BE397E
Bundling/uploading image
Checking image
Compressing image
Encrypting image
Splitting image...
Part: euca-centos-5.8-2012.07.05-x86_64.part.00
Part: euca-centos-5.8-2012.07.05-x86_64.part.01
Part: euca-centos-5.8-2012.07.05-x86_64.part.02
Part: euca-centos-5.8-2012.07.05-x86_64.part.03
Part: euca-centos-5.8-2012.07.05-x86_64.part.04
Part: euca-centos-5.8-2012.07.05-x86_64.part.05
Part: euca-centos-5.8-2012.07.05-x86_64.part.06
Part: euca-centos-5.8-2012.07.05-x86_64.part.07
Part: euca-centos-5.8-2012.07.05-x86_64.part.08
Part: euca-centos-5.8-2012.07.05-x86_64.part.09
Part: euca-centos-5.8-2012.07.05-x86_64.part.10
Part: euca-centos-5.8-2012.07.05-x86_64.part.11
Part: euca-centos-5.8-2012.07.05-x86_64.part.12
Part: euca-centos-5.8-2012.07.05-x86_64.part.13
Part: euca-centos-5.8-2012.07.05-x86_64.part.14
Part: euca-centos-5.8-2012.07.05-x86_64.part.15
Part: euca-centos-5.8-2012.07.05-x86_64.part.16
Part: euca-centos-5.8-2012.07.05-x86_64.part.17
Part: euca-centos-5.8-2012.07.05-x86_64.part.18
Part: euca-centos-5.8-2012.07.05-x86_64.part.19
Part: euca-centos-5.8-2012.07.05-x86_64.part.20
Part: euca-centos-5.8-2012.07.05-x86_64.part.21
Part: euca-centos-5.8-2012.07.05-x86_64.part.22
Part: euca-centos-5.8-2012.07.05-x86_64.part.23
Part: euca-centos-5.8-2012.07.05-x86_64.part.24
Part: euca-centos-5.8-2012.07.05-x86_64.part.25
Part: euca-centos-5.8-2012.07.05-x86_64.part.26
Part: euca-centos-5.8-2012.07.05-x86_64.part.27
Part: euca-centos-5.8-2012.07.05-x86_64.part.28
Part: euca-centos-5.8-2012.07.05-x86_64.part.29
Part: euca-centos-5.8-2012.07.05-x86_64.part.30
Part: euca-centos-5.8-2012.07.05-x86_64.part.31
Part: euca-centos-5.8-2012.07.05-x86_64.part.32
Part: euca-centos-5.8-2012.07.05-x86_64.part.33
Part: euca-centos-5.8-2012.07.05-x86_64.part.34
Part: euca-centos-5.8-2012.07.05-x86_64.part.35
Generating manifest /tmp/RXAdDH/euca-centos-5.8-2012.07.05-x86_64.manifest.xml
Checking bucket: testbucket
Uploading manifest file
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.00
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.01
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.02
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.03
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.04
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.05
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.06
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.07
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.08
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.09
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.10
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.11
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.12
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.13
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.14
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.15
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.16
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.17
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.18
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.19
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.20
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.21
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.22
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.23
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.24
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.25
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.26
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.27
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.28
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.29
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.30
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.31
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.32
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.33
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.34
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.35
Uploaded image as testbucket/euca-centos-5.8-2012.07.05-x86_64.manifest.xml
testbucket/euca-centos-5.8-2012.07.05-x86_64.manifest.xml
Installed image: emi-97B83704


run $ euca-describe-images to verify emi-97B83704, results as follows:
IMAGE    eki-855C3923    testbucket/vmlinuz-2.6.27.21-0.1-xen.manifest.xml    292622667431    available    public        x86_64kernel                instance-store
IMAGE    emi-97B83704    testbucket/euca-centos-5.8-2012.07.05-x86_64.manifest.xml    292622667431    available    public        x86_64    machine    eki-855C3923    eri-05BE397E        instance-store
IMAGE    eri-05BE397E    testbucket/initrd-2.6.27.21-0.1-xen.manifest.xml    292622667431    available    public        x86_64ramdisk                instance-store


Images can also be seen @8443, Resource management/Images

5. Generate admin credentials


/usr/sbin/euca_conf --get-credentials admin.zip

unzip admin.zip
sudo -s
source eucarc


sources:
1. Xen setting: https://help.ubuntu.com/community/XenProposed
 







Thursday, October 25, 2012

Working on IBM x3650

Network setting in Ubuntu server 12.10

sudo nano /etc/network/interfaces

auto lo
iface lo inet loopback

auto eth0
iface eth0 inet dhcp
netmask 255.255.255.0
broadcast 10.1.62.255


auto eth1
iface eth1 inet dhcp
netmask 255.255.255.0
broadcast 10.1.62.255

sudo service networking restart
sudo ifup eth0
sudo ifup eth1

ifconfig -a
will see the results of configuration
eth0 with ip 10.1.62.xxx
eth1 with ip 192.168.2.99

to test ping google.com



Monday, October 1, 2012

Building a Java Application with Nutch 2.0 and Solr 3.6

Apache Nutch 2.0 source @ ~/tmp/apache-solr-3.6.0 SOLR_HOME
Apache Solr 3.6 binary @ ~/tmp/apache-nutch-2.0-src NUTCH_HOME
Apache hbase @ ~/tmp/hbase-0.90.6 HBASE_HOME

1. Preparing libraries in Eclipse

Go to /path/to/solr/dist and open apache-solr-3.6.0.war with your favorite archive manager. Go to /WEB-INF/lib/ and extract everything there to /path/to/solr/dist. This will allow us to include all the libraries we need in our Java application.

2. Creating an Eclipse project

Create a Java project with a class called MyCrawler.

Add a lib folder into the project, then add two sub folders: nutch and solr. Copy everything from NUTCH_HOME/lib to lib/nutch and everything from SOLR_HOME/dist to lib/solr.

Add a plugins folder into the project, add everything from NUTCH_HOME/plugins to it.

Add a urls folder and create a seed.txt file. Put seed urls here.

Create two folders somewhere on the file system: nutchconf and solrconf, and copy all the files from NUTCH_HOME/conf and SOLR_HOME/example/solr/conf to them respectively. 

In Eclipse project properties->Java Build Path, click on Add Class folder  and add nutchconf and solrconf folders in. Go to Order and Exportfind the entries for nutchConf and solrConf and move them to the top. 

Change Lucene_core_3.4.0.jar to Lucene_core_3.6.0.jar.

3. Adding code into MyCrawler.
 package websearch.crawler;  
 import java.util.StringTokenizer;  
 import org.apache.hadoop.util.ToolRunner;  
 import org.apache.nutch.indexer.solr.SolrIndexerJob;  
 import org.apache.nutch.crawl.Crawler;  
 import org.apache.nutch.util.NutchConfiguration;  
 import org.apache.solr.client.solrj.SolrQuery;  
 import org.apache.solr.client.solrj.SolrServerException;  
 import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;  
 import org.apache.solr.client.solrj.response.QueryResponse;  
 import org.apache.solr.common.SolrDocumentList;  
 public class MyCrawler {  
      /**  
       * @param args  
       * @throws Exception   
       */  
      public static void main(String[] args) throws Exception {  
           String crawlArg = "urls -depth 3 -topN 5";  
     // Run Crawl tool  
     try {  
         ToolRunner.run(NutchConfiguration.create(), new Crawler(),  
                 tokenize(crawlArg));  
     } catch (Exception e) {  
         e.printStackTrace();  
         return;  
     }  
     String indexArg = "http://localhost:8983/solr -reindex";  
     //Run Solr index tool
     try {  
       ToolRunner.run(NutchConfiguration.create(), new SolrIndexerJob(),  
               tokenize(indexArg));  
     } catch (Exception e) {  
       e.printStackTrace();  
       return;  
     }  
     // Let's query for something!  
     String url = "http://localhost:8983/solr";  
     CommonsHttpSolrServer server = new CommonsHttpSolrServer( url );  
     SolrQuery query = new SolrQuery();  
     query.setQuery("content:mycontent"); // Searching mycontent in query  
     query.addSortField("content", SolrQuery.ORDER.asc);  
     QueryResponse rsp;  
     try {  
         rsp = server.query(query);  
     } catch (SolrServerException e) {  
         // TODO Auto-generated catch block  
         e.printStackTrace();  
         return;  
     }  
     // Display the results in the console  
     SolrDocumentList docs = rsp.getResults();  
     for (int i = 0; i < docs.size(); i++) {  
         System.out.println(docs.get(i).get("title").toString() + " Link: "  
                 + docs.get(i).get("url").toString());  
     }    
 }  
      /**  
       * Helper function to convert a string into an array of strings by  
       * separating them using whitespace.  
       *   
       * @param str  
       *      string to be tokenized  
       * @return an array of strings that contain a each word each  
       */  
      public static String[] tokenize(String str) {  
           StringTokenizer tok = new StringTokenizer(str);  
           String tokens[] = new String[tok.countTokens()];  
           int i = 0;  
           while (tok.hasMoreTokens()) {  
                tokens[i] = tok.nextToken();  
                i++;  
           }  
           return tokens;  
      }  
 }  



4. Run.
start hbase
start solr
run MyCrawler as Java application


Other resources:





Tuesday, September 25, 2012

Nutch Gora HBase Solr

1. download Nutch 2.0
2. download HBase version 0.90.6, nutch 2.0 doesn't work with higher version. 
 follow instructions for installation http://hbase.apache.org/book/quickstart.html
note: if there is a proxy interface setting error, then change localhost/127.0.0.1 to IP address at /etc/hosts; if running in a virtual machine, then change IP address of the virtual host.
3. follow tutorials at http://wiki.apache.org/nutch/Nutch2Tutorial
nutch-site.xml


<property>
 <name>http.agent.name</name>
 <value>Spider</value>
</property>
<property>
 <name>http.robots.agents</name>
 <value>Spider,*</value>
</property>

<property>
 <name>storage.data.store.class</name>
 <value>org.apache.gora.hbase.store.HBaseStore</value>
 <description>Default class for storing data</description>
</property>

<property>
 <name>http.content.limit</name>
<value>-1</value>
</property>

<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

ivy/ivy.xml

<!-- Uncomment this to use HBase as Gora backend. -->
    
    <dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*->default" />

gora.properties


gora.datastore.default=org.apache.gora.hbase.store.HBaseStore


4. ant build
    build files are in /runtime
5. follow instructions of nutch 1.5 and solr

bin/nutch crawl urls  -depth 3 -topN 5 (this stores results in hbase but does not index in Solr)
to index in Solr bin/nutch solrindex http://localhost:8983/solr -reindex 




6. after running nutch, all files are stored in hbase /webpage
   bin/hbase shell
   hbase(main):002:0> scan "webpage"




Friday, September 21, 2012

Eclipse installation on Ubuntu

1. download and extract.
2.
sudo mv eclipse /opt/
cd /opt/eclipse
sudo chown -R root:root eclipse
sudo chmod -R +r eclipse

3. Create an eclipse executable in your path

sudo touch /usr/bin/eclipse
sudo chmod 755 /usr/bin/eclipse
sudo nano /usr/bin/eclipse
copy this into nano
#!/bin/sh
#export MOZILLA_FIVE_HOME="/usr/lib/mozilla/"
export ECLIPSE_HOME="/opt/eclipse"

$ECLIPSE_HOME/eclipse $*
4. Create a gnome menu item
sudo nano /usr/share/applications/eclipse.desktop
copy this into nano
[Desktop Entry]
Encoding=UTF-8
Name=Eclipse
Comment=Eclipse IDE
Exec=eclipse
Icon=/opt/eclipse/icon.xpm
Terminal=false
Type=Application
Categories=GNOME;Application;Development;
StartupNotify=true
save and exit nano
6) Launch Eclipse for the first time
/opt/eclipse/eclipse -clean &

Thursday, September 20, 2012

Solr Query

default for all *:*, : used to separate parameters and values,e.g.

keywords:web

Using AND (&&), OR

content: OWL OR title: Semantic Web
content: OWL && title: Semantic Web
content: OWL AND title: Semantic Web

Nested Queries

using _query_

content:"semantic" AND _query_:"alcategory:computer_internet" AND _query_:"title:web" AND _query_:"keywords:data"

source: http://searchhub.org/dev/2009/03/31/nested-queries-in-solr/

Wednesday, September 19, 2012

Crawling PDF using Apache Nutch

Nutch uses pdfbox plugin for crawling pdf documents. The plugin is located in /plugins/parse-tika

in nutch-site.xml

add:
<property>
 <name>http.content.size</name>
<value>-1</value>
</property>

<property> 

<name>plugin.includes</name> 

<value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|que 
ry-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlno 
rmalizer-(pass|regex|basic)</value> 

</property> 


testing using url: http://www.master.netseven.it/files/262-Nutch.pdf

note:
parsed with unreadable symbols for pdf files converted from Microsoft Word. example URL: http://clgiles.ist.psu.edu/IST441/materials/nutch-lucene/nutch-crawling-and-searching.pdf