Friday, November 23, 2012

Run VM instance on Eucalyptus, Xen

(on front machine)

1.  check image

 $ euca-describe-images  

note down the image id: emi-xxxxxxx

2. create key pair

$ euca-add-keypair user01-keypair > user01-keypair.private  
$ chmod 0600 user01-keypair.private  
$ euca-describe-keypairs  

 KEYPAIR     user01-keypair 
03:d8:30:69:36:41:62:f8:6b:60:8d:17:19:c1:29:16:5b:6e:79:a9  

3. Authorize security groups

unlimited support on ssh port 22

 euca-authorize -P tcp -p 22 -s 0.0.0.0/0 default  

4. Launch an instance

euca-run-instances emi-48AE3FD9 -k user01-keypair

if instance is launched, the following should be shown:

 RESERVATION     r-B5A143D7     292622667431     default  
 INSTANCE     i-28C442AB     emi-48AE3FD9     0.0.0.0     0.0.0.0     pending     user01-keypair     0          m1.small     2012-11-19T15:28:09.818Z     cluster01     eki-855C3923     eri-05BE397E          monitoring-disabled     0.0.0.00.0.0.0               instance-store            

5. Get instance status & log into the instance

euca-describe-instances
ssh -i user01-keypair.private root@<instance_ip>  

Trouble shooting: 

1) error: RunInstancesType: Not enough resources (0 in cluster01 < 1): vm instances.

run the following on front machine, NC and CC:

 euca-describe-availability-zones verbose  
 ip addr show  
 route -n  

run the following on cluster machine:

 euca-describe-services -E  

2) @ nc.log (*** still searching for solution****)

 [Mon Nov 19 15:36:52 2012][002893][EUCAINFO ] [i-CA574225] tuning root file system on disk 0 partition 1  
 [Mon Nov 19 15:36:52 2012][002893][EUCAERROR ] {2798909184} error: bad return code from cmd '//usr/lib/eucalyptus/euca_rootwrap /sbin/tune2fs /dev/mapper/euca-W9JINGX8VP5TYHDLPJPZM-i-CA574225-emi-48AE3$  
 [Mon Nov 19 15:36:52 2012][002893][EUCADEBUG ] /sbin/tune2fs: Bad magic number in super-block while trying to open /dev/mapper/euca-W9JINGX8VP5TYHDLPJPZM-i-CA574225-emi-48AE3FD9-6d4be1e1  
 Couldn't find valid filesystem superblock.  
 tune2fs 1.42 (29-Nov-2011)  
 [Mon Nov 19 15:36:52 2012][002893][EUCAINFO ] {2798909184} error: cannot tune file system on '/dev/mapper/euca-W9JINGX8VP5TYHDLPJPZM-i-CA574225-emi-48AE3FD9-6d4be1e1'  
 [Mon Nov 19 15:36:52 2012][002893][EUCAERROR ] [i-CA574225] error: failed to tune root file system: blobstore.c:3196 file access only supported for uncloned blockblobs  
 [Mon Nov 19 15:36:52 2012][002893][EUCAERROR ] [i-CA574225] error: failed to create artifact emi-48AE3FD9-6d4be1e1 (error=1, may retry) on try 1  
 [Mon Nov 19 15:36:52 2012][002893][EUCADEBUG ] {2798909184} detaching from loop device /dev/loop5  
 [Mon Nov 19 15:36:52 2012][002893][EUCADEBUG ] {2798909184} detaching from loop device /dev/loop4  
 [Mon Nov 19 15:36:52 2012][002893][EUCADEBUG ] [i-CA574225] error: failed to implement artifact 019|emi-48AE3FD9-6d4be1e1 on try 1  
 [Mon Nov 19 15:36:52 2012][002893][EUCAERROR ] [i-CA574225] error: failed to provision dependency emi-48AE3FD9-6d4be1e1 for artifact dsk-48AE3FD9-09c9bcae (error=1) on try 1  
 [Mon Nov 19 15:36:52 2012][002893][EUCADEBUG ] [i-CA574225] error: failed to implement artifact 024|dsk-48AE3FD9-09c9bcae on try 1  
 [Mon Nov 19 15:36:52 2012][002893][EUCAERROR ] [i-CA574225] error: failed to provision dependency dsk-48AE3FD9-09c9bcae for artifact i-CA574225 (error=1) on try 1  
 [Mon Nov 19 15:36:52 2012][002893][EUCADEBUG ] [i-CA574225] error: failed to implement artifact 013|i-CA574225 on try 1  
 [Mon Nov 19 15:36:52 2012][002893][EUCAERROR ] [i-CA574225] error: failed to implement backing for instance  
 [Mon Nov 19 15:36:53 2012][002893][EUCAERROR ] [i-CA574225] error: failed to prepare images for instance (error=1)  
 [Mon Nov 19 15:36:53 2012][002893][EUCADEBUG ] [i-CA574225] state change for instance: Staging -> Shutoff (Pending)  
 [Mon Nov 19 15:36:53 2012][002893][EUCAINFO ] [i-CA574225] cleaning up state for instance  

3) if nc.log does not log or error in cc.log
 ERROR: DescribeResource() could not be invoked (check NC host, port, and credentials)

-check whether nc node is running, if not run

 sudo service eucalyptus-nc start  

- check whether xend is runing using
 sudo xm list  

if not run

 sudo service xend start  

4)

$ virsh version
Compiled against library: libvir 0.9.8
Using library: libvir 0.9.8
Using API: QEMU 0.9.8
Running hypervisor: QEMU 1.0.0


solution:
$ sudo -s
# echo "export VIRSH_DEFAULT_CONNECT_URI=xen:///" >> /etc/profile.d/libvirtd.sh
# chmod +x /etc/profile.d/libvirtd.sh

# reboot 

modify /etc/xen/xend-config.sxp
(xend-http-server yes)
(xend-unix-server yes)

(xend-unix-path /var/lib/xend/xend-socket)


#sudo service xend restart
#sudo service eucalyptus-nc restart
#virsh version

2012-11-22 14:08:10.678+0000: 3051: info : libvirt version: 0.9.8
2012-11-22 14:08:10.678+0000: 3051: warning : xenHypervisorMakeCapabilities:2751 : Failed to get host power management capabilities
Compiled against library: libvir 0.9.8
Using library: libvir 0.9.8
Using API: Xen 0.9.8
Running hypervisor: Xen 4.1.0

5) running with ip 0.0.0.0

check the instance in the console using

 sudo xm console <instance-id>  

look for network configuration and ip address.

6) RunInstancesType: Failed to lookup kernel image information unknown because of: Attempt to resolve a kerneId for BootableSet:machine=arn:aws:euca:eucalyptus:388002304024:image/emi-3F8B38A3/:ramdisk=false:kernel=false:isLinux=true during request RunInstancesType:2cc2fab3-6e36-45f9-986a-33dcb278399e:return=true:epoch=null:status=null

kernel and rmdisk are not registered. check in download files for correspondent hypervisor, then use 3 commands to bundle, upload, register both of them. run the instance again.

Monday, November 19, 2012

Local JARs for Maven Dependency


1. download jar into /mavenlocal

2. run

mvn install:install-file -DgroupId=classifier4j -DartifactId=Classifier4J -Dversion=0.6 -Dpackaging=jar -DcreateChecksum=true -Dfile=/home/xzhao/mavenlocal/Classifier4J-0.6.jar

3. Add dependency


<dependency>
    <groupId>classifier4j</groupId>
    <artifactId>Classifier4J</artifactId>
    <version>0.6</version>
</dependency>

The following method not working well


1. Download jar file into /lib

2. Use Maven to install to project repo

 mvn install:install-file -DlocalRepositoryPath=repo -DcreateChecksum=true -Dpackaging=jar -Dfile=[your-jar] -DgroupId=[...] -DartifactId=[...] -Dversion=[...]  

maven repository created in /lib/repo



3. Add repository in pom.xml

 <repository>  
   <id>repo</id>  
   <url>file://${project.basedir}/repo</url>  
 </repository>  

4. Add dependencies


sources:
  • http://stackoverflow.com/questions/364114/can-i-add-jars-to-maven-2-build-classpath-without-installing-them
  • http://blog.dub.podval.org/2010/01/maven-in-project-repository.html

Friday, November 16, 2012

Create Ubuntu Xen image for Eucalyptus

(perform steps 1-6 on node controller; 7- on client machine)

1. create a folder and download ISO file.

sudo mkdir ubuntu-xen-manual
cd ubuntu-xen-manual
wget ***.iso

2. Create a 4GB virtual disk

sudo dd if=/dev/zero of=ubuntu-12.04D.img bs=1M count=4096
 

3. Create Xen configuration file (xen.cfg) with followings:

 name = "ubuntubox"  
 #make sure kernel is in right place  
 kernel = "/usr/lib/xen-default/boot/hvmloader"  
 memory = 1024  
 builder = "hvm"  
 #make sure device_model is in right place  
 device_model = "/usr/lib/xen-default/bin/qemu-dm"  
 boot = "d"  
 disk = ['file:~/ubuntu-xen-manual/***.iso,hdc:cdrom,r',  
 'file:~/ubuntu-xen-manual/ubuntu-12.04D.img,hda,w']  
 vif = ['']  
 #dhcp="on"  
 vnc = 1  
 vncdisplay = 7  
 pae = 1  
 

4. Start domU

sudo xen create xen.cfg  

5. Connect with a VNC viewer (if desktop version)

sudo apt-get install xvnc4viewer
xvncviewer localhost:7

6. Find out the starting block and the block size of the root file system.

sudo parted ubuntu-12.04D.img
(parted) U
Unit? [compact]? b
(parted) p
 
You'll see an "unrecognized disk label" message because it is a new drive.
 
(parted) mklabel msdos
(parted) print free


 Number  Start   End          Size         Type  File system  Flags
        16384B  4294967295B  4294950912B        Free Space
(parted) q
 
sudo dd if=ubuntu-12.04D.img of=rootfs.img bs=1 skip=16384 count=4294950912

(*the extraction takes quite a while; to test change the image size to smaller 512*)
 
the root image is created as rootfs.img.
 

7. Bundle, upload and register root image to Eucalyptus

scp rootfs.img to client machine
open a terminal on client machine:

sudo scp server01@10.1.62.172:~/ubuntu-xen-manual/rootfs.img ~/

source eucarc

cd ~/.euca
source eucarc

euca-bundle-image -i ~/rootfs.img (execute in ~/.euca; otherwise EC2_CERT not found error)

 Checking image  
 Compressing image  
 Encrypting image  
 Splitting image...  
 Part: rootfs.img.part.00  
 Generating manifest /tmp/rootfs.img.manifest.xml  

euca-upload-bundle -b ubuntu -m /tmp/rootfs.img.manifest.xml

 Checking bucket: ubuntu  
 Creating bucket: ubuntu  
 Uploading manifest file  
 Uploading part: rootfs.img.part.00  
 Uploaded image as ubuntu/rootfs.img.manifest.xml  

 euca-register ubuntu/rootfs.img.manifest.xml

 IMAGE     emi-48AE3FD9  

euca-describe-images

 IMAGE     emi-48AE3FD9     ubuntu/rootfs.img.manifest.xml 
292622667431     available public          i386     machine 
eki-855C3923     eri-05BE397E          instance-store  

Wednesday, November 7, 2012

Configure SIREn, Solr on Jetty

(Default settings: SIREn with Lucene 3.5, Solr 3.6 with default Jetty)

1. create a /lib folder under SOLR_HOME/example/solr, add the following from SIREn targets:


siren-core-0.2.3-SNAPSHOT.jar,
siren-qparser-0.2.3-SNAPSHOT.jar
siren-solr-0.2.3-SNAPSHOT.jar

2. modify solrconfig.xml, add following:


 <!-- Example of Registration of the siren query parser. -->  
  <queryParser name="siren" class="org.sindice.siren.solr.SirenQParserPlugin"/>  
  <requestHandler name="siren" class="solr.StandardRequestHandler">  
   <!-- default values for query parameters -->  
    <lst name="defaults">  
     <str name="defType">siren</str>  
     <str name="echoParams">explicit</str>  
                 <!-- Disable field query in keyword parser -->  
     <str name="disableField">true</str>  
     <str name="qf">  
      ntriple^1.0 url^1.2  
     </str>  
     <str name="nqf">  
      ntriple^1.0  
     </str>  
     <!-- the NTriple query multi-field operator:  
       - disjunction: the query should match in at least one of the fields  
       - scattered: each Ntriple patterns should match in at least on of the fields  
     -->   
     <str name="nqfo">scattered</str>  
     <str name="tqf">  
      tabular^1.0  
     </str>  
     <!-- the Tabular query multi-field operator:  
       - disjunction: the query should match in at least one of the fields  
       - scattered: each tabular patterns should match in at least on of the fields  
     -->  
     <str name="tqfo">scattered</str>  
     <str name="fl">  
      id  
     </str>  
    </lst>  
  </requestHandler>  

3. modify schema.xml, add following and rename fields url and id if they exist in the file already.


 <!-- The ID (URL) of the document   
         Use the 'string' field type (no tokenisation)  
      -->  
        <field name="id" type="string" indexed="true" stored="true" required="false"/>  
       <!-- The URL of the document   
         Use the 'text' field type in order to be tokenised  
      -->  
        <field name="url" type="uri" indexed="true" stored="true" required="true"/>  
 <!-- n-triple indexing scheme -->  
        <field name="ntriple" type="ntriple" indexed="true" stored="true" multiValued="false"/>  
     <!-- tabular indexing scheme -->  
     <field name="tabular" type="tabular" indexed="true" stored="false" multiValued="false"/>  

 <!-- A uri field that uses WhitespaceTokenizer and WordDelimiterFilter to   
      split URIs into multiple compoenents. Stopwords is customized by   
      external files.  
      omitNorms is true since it is a short field, and it does not make   
      really sense on URI.  
      Does not use the ASCIIFoldingExpansionFilter since URIs should not  
      contain accented characters.  
   -->  
   <fieldType name="uri" class="solr.TextField" omitNorms="true" positionIncrementGap="100">  
    <analyzer type="index">  
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>  
     <!-- Splits words into subwords based on delimiters  
        - split subwords based on case change  
        - preserveOriginal="1" in order to preserve the original word.  
        Removed split based on numerics to fix SND-355 and SND-1283   
     -->  
     <filter class="solr.WordDelimiterFilterFactory"   
         generateWordParts="1"   
         generateNumberParts="1"   
         catenateWords="0"   
         catenateNumbers="0"   
         catenateAll="0"   
         splitOnCaseChange="1"  
         splitOnNumerics="0"  
         preserveOriginal="1"/>  
     <!-- Filters out those tokens *not* having length min through max   
        inclusive. -->  
     <filter class="solr.LengthFilterFactory" min="2" max="256"/>  
     <!-- Change to lowercase text -->  
     <filter class="solr.LowerCaseFilterFactory"/>  
     <!-- Case insensitive stop word removal.  
      add enablePositionIncrements=true in both the index and query  
      analyzers to leave a 'gap' for more accurate phrase queries.  
     -->  
     <filter class="solr.StopFilterFactory"  
         ignoreCase="true"  
         words="stopwords.txt"  
         enablePositionIncrements="true"  
         />  
    </analyzer>  
    <analyzer type="query">  
     <!-- whitespace tokenizer to not tokenize URI -->  
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>  
     <!-- Filters out those tokens *not* having length min through max   
        inclusive. -->  
     <filter class="solr.LengthFilterFactory" min="2" max="256"/>  
     <filter class="solr.LowerCaseFilterFactory"/>  
     <filter class="solr.StopFilterFactory"  
         ignoreCase="true"  
         words="stopwords.txt"  
         enablePositionIncrements="true"  
         />  
     <!-- Replace Qnames by their name spaces in URIs. -->  
     <filter class="org.sindice.siren.solr.analysis.QNamesFilterFactory"   
         qnames="qnames.txt"/>  
    </analyzer>  
   </fieldType>  
 <!--  
            The SIREn field type:  
                The top-level analyzers must be defined in the top-level analyzer   
    configuration file (ntriple-analyzers.xml) and the datatype analyzers in   
    the datatype analyzer configuration file (ntriples-datatypes.xml).   
                Field norms are not useful for SIREn fields. Set omitNorms to true reduces  
                memory consumption, and improve ranking.  
    omitTermFreqAndPositions *must* be set to false.  
           -->  
   <fieldType name="ntriple" class="org.sindice.siren.solr.schema.SirenField"  
         omitNorms="true"   
         omitTermFreqAndPositions="false"  
         analyzerConfig="tuple-analyzers.xml"  
         datatypeConfig="tuple-datatypes.xml"/>  
   <fieldType name="tabular" class="org.sindice.siren.solr.schema.SirenField"  
         omitNorms="true"   
         omitTermFreqAndPositions="false"  
         analyzerConfig="tuple-analyzers.xml"  
         datatypeConfig="tuple-datatypes.xml"/>  

 <similarity class="org.sindice.siren.similarity.SirenSimilarity"/>  

4. copy the following files from SIREN_HOME/siren_solr/example/solr/config to SOLR_HOME/example/solr/config

tuple-analyzers.xml
tuple-datatypes.xml
qnames.txt

5. Restart default Jetty in Solr by java -jar start.jar


6. test with sample code in SIREN_HOME/siren_solr/example/

The examples are indexed successfully but the queries return no result.

P.S. SIREn doesn't support SPARQL.

sources:

  • https://github.com/rdelbru/SIREn/blob/master/siren-solr/example/INSTALL.txt



Install SIREn on Ubuntu 12.04

1. check jdk and maven installation.

$sudo apt-get install maven (*maven 3 will be installed*)

2. run maven at the SIREn directory

$mvn package

3. check jars 

The jar files are located under /target in each folder.

P.S. If the following error occurs:

[ERROR] Failed to execute goal on project siren-core: Could not resolve dependencies for project org.sindice.siren:siren-core:jar:0.2.3-SNAPSHOT: Failure to find com.google.code.caliper:caliper:jar:1.0-SNAPSHOT in https://oss.sonatype.org/content/groups/public/ was cached in the local repository, resolution will not be reattempted until the update interval of oss-sonatype has elapsed or updates are forced -> [Help 1]

Modify pom.xml under siren-core

Change
 <dependency>  
    <groupId>com.google.code.caliper</groupId>  
    <artifactId>caliper</artifactId>  
    <version>1.0-SNAPSHOT</version>  
    <scope>test</scope>  
   </dependency>  

To
 <dependency>  
 <groupId>com.google.caliper</groupId>  
 <artifactId>caliper</artifactId>  
 <version>0.5-rc1</version>  
 <scope>test</scope>  
 </dependency>  

Sources:

  • https://github.com/rdelbru/SIREn/wiki/Getting-Started



Tuesday, October 30, 2012

Ubuntu Server 12.04 LTS, Eucalyptus and Xen

In this installation, 2 Dell machines (Dual Core, 4GB RAM, 250 GB hard disk) and 1 IBM x3650 machine are used. 

1. Install Eucalyptus cloud and cluster controller (Dell)

1.1. Ubuntu Server 12.04 LTS 
  • default installation on openssh
  • make network is set correctly
1.2 Install postfix which will be used by Eucalyptus for sending confirmation emails to new registered users.
1.3 Install Eucalyptus
1.3.1 download and add pub key
  • sudo apt-key add c1240596-eucalyptus-release-key.pub
 
1.3.2 add deb
Create a file in /etc/apt/sources.list.d called eucalyptus.list with the following content:


  • deb http://downloads.eucalyptus.com/software/eucalyptus/3.1/ubuntu precise main

On all machines that will run either Eucalyptus or Euca2ools, create a file in/etc/apt/sources.list.d called euca2ools.list with the following content:
  • deb http://downloads.eucalyptus.com/software/euca2ools/2.1/ubuntu precise main
1.3.3 install
  • sudo apt-get update  
  • sudo apt-get install eucalyptus-cloud eucalyptus-cc eucalyptus-walrus
1.3.4 start
  • /usr/sbin/euca_conf --initialize (for first time)
    sudo service eucalyptus-cloud start
    sudo service eucalyptus-cc start 

    1.3.5 verify
    check on Web browser: https://<ip>:8443
first time login will allow changing password, set email address admin/admin


2. Install Eucalyptus node controller (IBM)

2.1 Repeat 1.1, 1.3.1, 1.3.2
2.2 install Xen
  • sudo apt-get install iproute  iptables module-init-tools python2.5 python2.6
  • sudo apt-get  install xen-utils
2.3 Modify GRUB to boot default on Xen
sudo sed -i 's/GRUB_DEFAULT=.*\+/GRUB_DEFAULT="Xen 4.1-amd64"/' /etc/default/grub
sudo update-grub
 
Set the default toolstack to xm (aka xend):
sudo sed -i 's/TOOLSTACK=.*\+/TOOLSTACK="xm"/' /etc/default/xen
Now reboot:
sudo reboot
And then verify that the installation has succeeded: 
sudo xm list
Name ID Mem VCPUs State Time(s)
Domain-0 0 945 1 r----- 11.3

2.4 Network configuration
check eth0 or eth1 is used. the following assumes eth0 and dhcp are used.

sudo apt-get install bridge-utils
 
Edit /etc/network/interfaces, and add this:  

auto xenbr0
iface xenbr0 inet dhcp
    bridge_ports eth0
 
Restart networking to enable xenbr0 bridge: 
 
sudo service networking restart


 
2.5 install Eucalyptus node controller
  • sudo apt-get install eucalyptus-nc
    sudo service eucalyptus-nc start

3. Register

 Register CC (at CLC machine)

sudo /usr/sbin/euca_conf --register-cluster --partition cluster01  --host  10.1.62.178 --component  cc-10.1.62.178
(this may require root access: using sudo passwd to add a new password for root)

Register NC (at CC machine)


sudo /usr/sbin/euca_conf --register-nodes "<node0_IP_address> ... <nodeN_IP_address>"

Register walrus at CLC machine

sudo /usr/sbin/euca_conf --register-walrus --partition walrus --host <walrus_IP_address> --component <walrus-hostname>


4. Install Eucalyptus client machine (Dell)

4.1 install Ubuntu Desktop 12.04 LTS
4.2 repeat 1.3.1 and 1.3.2
4.3 install euca2ools
  • sudo apt-get install euca2ools
apply a new user account from @8443

Get admin credentials
login as eucalyptus/admin @ 8443

download new credentials from admin@eucalyptus
unzip to ~/.euca
mkdir ~/.euca cd ~/.euca unzip <filepath>/euca2-<user>-x509.zip
source eucarc
<*remember to use admin credential to source*>

4.4 Add image from eustore
eustore-describe-images
eustore-install-image -b testbucket -i xxxxxxxx -k xen
<*this will take a while*>
the results look like as follows:

Downloading Image :  CentOS 5 1.3GB root, Hypervisor-Specific Kernels
0-----1-----2-----3-----4-----5-----6-----7-----8-----9-----10
##############################################################

Checking image bundle
Unbundling image
going to look for kernel dir : xen-kernel
Bundling/uploading kernel
Checking image
Compressing image
Encrypting image
Splitting image...
Part: vmlinuz-2.6.27.21-0.1-xen.part.00
Generating manifest /tmp/RXAdDH/vmlinuz-2.6.27.21-0.1-xen.manifest.xml
Checking bucket: testbucket
Creating bucket: testbucket
Uploading manifest file
Uploading part: vmlinuz-2.6.27.21-0.1-xen.part.00
Uploaded image as testbucket/vmlinuz-2.6.27.21-0.1-xen.manifest.xml
testbucket/vmlinuz-2.6.27.21-0.1-xen.manifest.xml
eki-855C3923
Bundling/uploading ramdisk
Checking image
Compressing image
Encrypting image
Splitting image...
Part: initrd-2.6.27.21-0.1-xen.part.00
Generating manifest /tmp/RXAdDH/initrd-2.6.27.21-0.1-xen.manifest.xml
Checking bucket: testbucket
Uploading manifest file
Uploading part: initrd-2.6.27.21-0.1-xen.part.00
Uploaded image as testbucket/initrd-2.6.27.21-0.1-xen.manifest.xml
testbucket/initrd-2.6.27.21-0.1-xen.manifest.xml
eri-05BE397E
Bundling/uploading image
Checking image
Compressing image
Encrypting image
Splitting image...
Part: euca-centos-5.8-2012.07.05-x86_64.part.00
Part: euca-centos-5.8-2012.07.05-x86_64.part.01
Part: euca-centos-5.8-2012.07.05-x86_64.part.02
Part: euca-centos-5.8-2012.07.05-x86_64.part.03
Part: euca-centos-5.8-2012.07.05-x86_64.part.04
Part: euca-centos-5.8-2012.07.05-x86_64.part.05
Part: euca-centos-5.8-2012.07.05-x86_64.part.06
Part: euca-centos-5.8-2012.07.05-x86_64.part.07
Part: euca-centos-5.8-2012.07.05-x86_64.part.08
Part: euca-centos-5.8-2012.07.05-x86_64.part.09
Part: euca-centos-5.8-2012.07.05-x86_64.part.10
Part: euca-centos-5.8-2012.07.05-x86_64.part.11
Part: euca-centos-5.8-2012.07.05-x86_64.part.12
Part: euca-centos-5.8-2012.07.05-x86_64.part.13
Part: euca-centos-5.8-2012.07.05-x86_64.part.14
Part: euca-centos-5.8-2012.07.05-x86_64.part.15
Part: euca-centos-5.8-2012.07.05-x86_64.part.16
Part: euca-centos-5.8-2012.07.05-x86_64.part.17
Part: euca-centos-5.8-2012.07.05-x86_64.part.18
Part: euca-centos-5.8-2012.07.05-x86_64.part.19
Part: euca-centos-5.8-2012.07.05-x86_64.part.20
Part: euca-centos-5.8-2012.07.05-x86_64.part.21
Part: euca-centos-5.8-2012.07.05-x86_64.part.22
Part: euca-centos-5.8-2012.07.05-x86_64.part.23
Part: euca-centos-5.8-2012.07.05-x86_64.part.24
Part: euca-centos-5.8-2012.07.05-x86_64.part.25
Part: euca-centos-5.8-2012.07.05-x86_64.part.26
Part: euca-centos-5.8-2012.07.05-x86_64.part.27
Part: euca-centos-5.8-2012.07.05-x86_64.part.28
Part: euca-centos-5.8-2012.07.05-x86_64.part.29
Part: euca-centos-5.8-2012.07.05-x86_64.part.30
Part: euca-centos-5.8-2012.07.05-x86_64.part.31
Part: euca-centos-5.8-2012.07.05-x86_64.part.32
Part: euca-centos-5.8-2012.07.05-x86_64.part.33
Part: euca-centos-5.8-2012.07.05-x86_64.part.34
Part: euca-centos-5.8-2012.07.05-x86_64.part.35
Generating manifest /tmp/RXAdDH/euca-centos-5.8-2012.07.05-x86_64.manifest.xml
Checking bucket: testbucket
Uploading manifest file
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.00
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.01
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.02
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.03
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.04
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.05
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.06
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.07
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.08
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.09
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.10
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.11
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.12
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.13
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.14
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.15
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.16
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.17
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.18
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.19
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.20
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.21
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.22
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.23
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.24
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.25
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.26
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.27
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.28
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.29
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.30
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.31
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.32
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.33
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.34
Uploading part: euca-centos-5.8-2012.07.05-x86_64.part.35
Uploaded image as testbucket/euca-centos-5.8-2012.07.05-x86_64.manifest.xml
testbucket/euca-centos-5.8-2012.07.05-x86_64.manifest.xml
Installed image: emi-97B83704


run $ euca-describe-images to verify emi-97B83704, results as follows:
IMAGE    eki-855C3923    testbucket/vmlinuz-2.6.27.21-0.1-xen.manifest.xml    292622667431    available    public        x86_64kernel                instance-store
IMAGE    emi-97B83704    testbucket/euca-centos-5.8-2012.07.05-x86_64.manifest.xml    292622667431    available    public        x86_64    machine    eki-855C3923    eri-05BE397E        instance-store
IMAGE    eri-05BE397E    testbucket/initrd-2.6.27.21-0.1-xen.manifest.xml    292622667431    available    public        x86_64ramdisk                instance-store


Images can also be seen @8443, Resource management/Images

5. Generate admin credentials


/usr/sbin/euca_conf --get-credentials admin.zip

unzip admin.zip
sudo -s
source eucarc


sources:
1. Xen setting: https://help.ubuntu.com/community/XenProposed
 







Thursday, October 25, 2012

Working on IBM x3650

Network setting in Ubuntu server 12.10

sudo nano /etc/network/interfaces

auto lo
iface lo inet loopback

auto eth0
iface eth0 inet dhcp
netmask 255.255.255.0
broadcast 10.1.62.255


auto eth1
iface eth1 inet dhcp
netmask 255.255.255.0
broadcast 10.1.62.255

sudo service networking restart
sudo ifup eth0
sudo ifup eth1

ifconfig -a
will see the results of configuration
eth0 with ip 10.1.62.xxx
eth1 with ip 192.168.2.99

to test ping google.com



Monday, October 1, 2012

Building a Java Application with Nutch 2.0 and Solr 3.6

Apache Nutch 2.0 source @ ~/tmp/apache-solr-3.6.0 SOLR_HOME
Apache Solr 3.6 binary @ ~/tmp/apache-nutch-2.0-src NUTCH_HOME
Apache hbase @ ~/tmp/hbase-0.90.6 HBASE_HOME

1. Preparing libraries in Eclipse

Go to /path/to/solr/dist and open apache-solr-3.6.0.war with your favorite archive manager. Go to /WEB-INF/lib/ and extract everything there to /path/to/solr/dist. This will allow us to include all the libraries we need in our Java application.

2. Creating an Eclipse project

Create a Java project with a class called MyCrawler.

Add a lib folder into the project, then add two sub folders: nutch and solr. Copy everything from NUTCH_HOME/lib to lib/nutch and everything from SOLR_HOME/dist to lib/solr.

Add a plugins folder into the project, add everything from NUTCH_HOME/plugins to it.

Add a urls folder and create a seed.txt file. Put seed urls here.

Create two folders somewhere on the file system: nutchconf and solrconf, and copy all the files from NUTCH_HOME/conf and SOLR_HOME/example/solr/conf to them respectively. 

In Eclipse project properties->Java Build Path, click on Add Class folder  and add nutchconf and solrconf folders in. Go to Order and Exportfind the entries for nutchConf and solrConf and move them to the top. 

Change Lucene_core_3.4.0.jar to Lucene_core_3.6.0.jar.

3. Adding code into MyCrawler.
 package websearch.crawler;  
 import java.util.StringTokenizer;  
 import org.apache.hadoop.util.ToolRunner;  
 import org.apache.nutch.indexer.solr.SolrIndexerJob;  
 import org.apache.nutch.crawl.Crawler;  
 import org.apache.nutch.util.NutchConfiguration;  
 import org.apache.solr.client.solrj.SolrQuery;  
 import org.apache.solr.client.solrj.SolrServerException;  
 import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;  
 import org.apache.solr.client.solrj.response.QueryResponse;  
 import org.apache.solr.common.SolrDocumentList;  
 public class MyCrawler {  
      /**  
       * @param args  
       * @throws Exception   
       */  
      public static void main(String[] args) throws Exception {  
           String crawlArg = "urls -depth 3 -topN 5";  
     // Run Crawl tool  
     try {  
         ToolRunner.run(NutchConfiguration.create(), new Crawler(),  
                 tokenize(crawlArg));  
     } catch (Exception e) {  
         e.printStackTrace();  
         return;  
     }  
     String indexArg = "http://localhost:8983/solr -reindex";  
     //Run Solr index tool
     try {  
       ToolRunner.run(NutchConfiguration.create(), new SolrIndexerJob(),  
               tokenize(indexArg));  
     } catch (Exception e) {  
       e.printStackTrace();  
       return;  
     }  
     // Let's query for something!  
     String url = "http://localhost:8983/solr";  
     CommonsHttpSolrServer server = new CommonsHttpSolrServer( url );  
     SolrQuery query = new SolrQuery();  
     query.setQuery("content:mycontent"); // Searching mycontent in query  
     query.addSortField("content", SolrQuery.ORDER.asc);  
     QueryResponse rsp;  
     try {  
         rsp = server.query(query);  
     } catch (SolrServerException e) {  
         // TODO Auto-generated catch block  
         e.printStackTrace();  
         return;  
     }  
     // Display the results in the console  
     SolrDocumentList docs = rsp.getResults();  
     for (int i = 0; i < docs.size(); i++) {  
         System.out.println(docs.get(i).get("title").toString() + " Link: "  
                 + docs.get(i).get("url").toString());  
     }    
 }  
      /**  
       * Helper function to convert a string into an array of strings by  
       * separating them using whitespace.  
       *   
       * @param str  
       *      string to be tokenized  
       * @return an array of strings that contain a each word each  
       */  
      public static String[] tokenize(String str) {  
           StringTokenizer tok = new StringTokenizer(str);  
           String tokens[] = new String[tok.countTokens()];  
           int i = 0;  
           while (tok.hasMoreTokens()) {  
                tokens[i] = tok.nextToken();  
                i++;  
           }  
           return tokens;  
      }  
 }  



4. Run.
start hbase
start solr
run MyCrawler as Java application


Other resources:





Tuesday, September 25, 2012

Nutch Gora HBase Solr

1. download Nutch 2.0
2. download HBase version 0.90.6, nutch 2.0 doesn't work with higher version. 
 follow instructions for installation http://hbase.apache.org/book/quickstart.html
note: if there is a proxy interface setting error, then change localhost/127.0.0.1 to IP address at /etc/hosts; if running in a virtual machine, then change IP address of the virtual host.
3. follow tutorials at http://wiki.apache.org/nutch/Nutch2Tutorial
nutch-site.xml


<property>
 <name>http.agent.name</name>
 <value>Spider</value>
</property>
<property>
 <name>http.robots.agents</name>
 <value>Spider,*</value>
</property>

<property>
 <name>storage.data.store.class</name>
 <value>org.apache.gora.hbase.store.HBaseStore</value>
 <description>Default class for storing data</description>
</property>

<property>
 <name>http.content.limit</name>
<value>-1</value>
</property>

<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

ivy/ivy.xml

<!-- Uncomment this to use HBase as Gora backend. -->
    
    <dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*->default" />

gora.properties


gora.datastore.default=org.apache.gora.hbase.store.HBaseStore


4. ant build
    build files are in /runtime
5. follow instructions of nutch 1.5 and solr

bin/nutch crawl urls  -depth 3 -topN 5 (this stores results in hbase but does not index in Solr)
to index in Solr bin/nutch solrindex http://localhost:8983/solr -reindex 




6. after running nutch, all files are stored in hbase /webpage
   bin/hbase shell
   hbase(main):002:0> scan "webpage"




Friday, September 21, 2012

Eclipse installation on Ubuntu

1. download and extract.
2.
sudo mv eclipse /opt/
cd /opt/eclipse
sudo chown -R root:root eclipse
sudo chmod -R +r eclipse

3. Create an eclipse executable in your path

sudo touch /usr/bin/eclipse
sudo chmod 755 /usr/bin/eclipse
sudo nano /usr/bin/eclipse
copy this into nano
#!/bin/sh
#export MOZILLA_FIVE_HOME="/usr/lib/mozilla/"
export ECLIPSE_HOME="/opt/eclipse"

$ECLIPSE_HOME/eclipse $*
4. Create a gnome menu item
sudo nano /usr/share/applications/eclipse.desktop
copy this into nano
[Desktop Entry]
Encoding=UTF-8
Name=Eclipse
Comment=Eclipse IDE
Exec=eclipse
Icon=/opt/eclipse/icon.xpm
Terminal=false
Type=Application
Categories=GNOME;Application;Development;
StartupNotify=true
save and exit nano
6) Launch Eclipse for the first time
/opt/eclipse/eclipse -clean &

Thursday, September 20, 2012

Solr Query

default for all *:*, : used to separate parameters and values,e.g.

keywords:web

Using AND (&&), OR

content: OWL OR title: Semantic Web
content: OWL && title: Semantic Web
content: OWL AND title: Semantic Web

Nested Queries

using _query_

content:"semantic" AND _query_:"alcategory:computer_internet" AND _query_:"title:web" AND _query_:"keywords:data"

source: http://searchhub.org/dev/2009/03/31/nested-queries-in-solr/

Wednesday, September 19, 2012

Crawling PDF using Apache Nutch

Nutch uses pdfbox plugin for crawling pdf documents. The plugin is located in /plugins/parse-tika

in nutch-site.xml

add:
<property>
 <name>http.content.size</name>
<value>-1</value>
</property>

<property> 

<name>plugin.includes</name> 

<value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|que 
ry-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlno 
rmalizer-(pass|regex|basic)</value> 

</property> 


testing using url: http://www.master.netseven.it/files/262-Nutch.pdf

note:
parsed with unreadable symbols for pdf files converted from Microsoft Word. example URL: http://clgiles.ist.psu.edu/IST441/materials/nutch-lucene/nutch-crawling-and-searching.pdf

Delete Documents from Solr index


Delete all documents from Solr index, enter the following in browser:

http://localhost:8983/solr/update?stream.body=<delete><query>*:*</query></delete>
http://localhost:8983/solr/update?stream.body=<commit/>

Friday, September 14, 2012

Nutch, Solr, UIMA integration




Fields may not be added. Then do the following changes:

<updateRequestProcessorChain name="uima" default="true">


<lst name="analyzeFields">
          <bool name="merge">false</bool>
          <arr name="fields">
            <str>content</str>
          </arr>
        </lst>
Check this link for UIMA AlchemyAPI annotator. http://uima.apache.org/d/uima-addons-current/AlchemyAPIAnnotator/AlchemyAPIAnnotatorUserGuide.html

Install Apache Nutch 1.5 and Solr 3.6.0 on Ubuntu 12



1. install jdk
$sudo apt-get install openjdk-7-jdk

2. Download and unpack Solr

sudo mkdir ~/tmp/solr
cd ~/tmp/solr
wget http://mirror.lividpenguin.com/pub/apache/lucene/solr/3.6.0/apache-solr-3.6.0.tgz
tar -xzvf apache-solr-3.6.0.tgz
*default jetty in solr, try to run java -jar start.jar* shutdown Ctrl-C

check http://localhost:8983/solr

3. Download and unpack Nutch

sudo mkdir ~/tmp/nutch
cd ~/tmp/nutch
wget  http://mirror.rmg.io/apache/nutch/1.5/apache-nutch-1.5-bin.tar.gz
tar -xzvf apache-nutch-1.5-bin.tar.gz

4. configure Nutch
chmod +x bin/nutch
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
add in conf/nutch-site.xml

<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>


mkdir -p urls
cd urls
touch seed.txt
nano seed.txt
add urls for crawling, for example

http://nutch.apache.org/


in conf/regex-urlfilter.txt and replace
# accept anything else
+.

with a regular expression matching the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.apache.org domain, the line should read:
 +^http://([a-z0-9]*\.)*nutch.apache.org/

5. configure Solr
 ~/tmp/solr/apache-solr-3.6.0/example/solr/conf
schema.xml add the following

<fieldType name="text" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.StopFilterFactory"
                    ignoreCase="true" words="stopwords.txt"/>
                <filter class="solr.WordDelimiterFilterFactory"
                    generateWordParts="1" generateNumberParts="1"
                    catenateWords="1" catenateNumbers="1" catenateAll="0"
                    splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EnglishPorterFilterFactory"
                    protected="protwords.txt"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
        </fieldType>


<field name="digest" type="text" stored="true" indexed="true"/>
<field name="boost" type="text" stored="true" indexed="true"/>
<field name="segment" type="text" stored="true" indexed="true"/>
<field name="host" type="text" stored="true" indexed="true"/>
<field name="site" type="text" stored="true" indexed="true"/>
<field name="content" type="text" stored="true" indexed="true"/>
<field name="tstamp" type="text" stored="true" indexed="false"/>
<field name="url" type="string" stored="true" indexed="true"/>
<field name="anchor" type="text" stored="true" indexed="false" multiValued="true"/>

change <uniqueKey>id</uniqueKey> to
<uniqueKey>url</uniqueKey>

in solrconfig.xml add
<requestHandler name="/nutch" class="solr.SearchHandler" >
    <lst name="defaults">
       <str name="defType">dismax</str>
       <str name="echoParams">explicit</str>
       <float name="tie">0.01</float>
       <str name="qf">
         content^0.5 anchor^1.0 title^1.2
       </str>
       <str name="pf">
         content^0.5 anchor^1.5 title^1.2 site^1.5
       </str>
       <str name="fl">
       url
       </str>
       <int name="ps">100</int>
       <bool name="hl">true</bool>
       <str name="q.alt">*:*</str>
<str name="hl.fl">title url content</str>
<str name="f.title.hl.fragsize">0</str>
<str name="f.title.hl.alternateField">title</str>
<str name="f.url.hl.fragsize">0</str>
<str name="f.url.hl.alternateField">url</str>
<str name="f.content.hl.fragmenter">regex</str>
</lst>
</requestHandler>
   
6. run Nutch crawler and index in Solr (make sure Solr has started)

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
 
check indexed files @ http://localhost:8983/solr
 
ignore error in Nutch running console:
SolrIndexer: starting at 2012-09-14 10:37:49
Indexing 11 documents
SolrIndexer: finished at 2012-09-14 10:38:36, elapsed: 00:00:46
SolrDeleteDuplicates: starting at 2012-09-14 10:38:36
SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
    at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
    at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
    at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)