Intro

This tutorial will show you how to create a simple Katta index with Hadoop. Since a Katta index is basically just a folder with Lucene index sub folders, creating a Katta index could be as easy as copying a set of Lucene indexes into a folder.

Creating a Lucene index is a relatively straightforward process. If you have never used Lucene before please read an excellent introduction to Lucene. The Lucene in Action book by Erik Hatcher and Otis Gospodnetić is also a great place to start.

When you have a lot of data, then the Lucene index can get to be unwieldily in terms of size and performance. In this scenario creating a Katta index with Lucene is a good idea since you can leverage the power of distributing the indexing over many machines using Hadoop.

Katta comes with sample code that we will use in this tutorial. Since Katta is released with the Apache License 2 feel free to copy, paste, modify code and build scripts to use it as starting point for your application.

Our steps will be:

  • Building Katta from sources
  • Setting up a Hadoop cluster
  • Getting data on the Hadoop DFS
  • Run our indexing job
  • Deploy the index in Katta

Build Katta from sources

First create a working folder and checkout the code from git.

mkdir ~/katta-tutorial
cd ~/katta-tutorial
git clone https://github.com/sgroschupf/katta.git
cd katta/

Now lets compile the code. Make sure you have java 1.6 installed. The code will not compile with java 1.5.

ant compile

Congratulations that was easy, next lets setup a hadoop.

Setting up Hadoop cluster

Setting up Hadoop is the most time consuming and complicated part. For this tutorial we use a localhost cluster (i.e. you run Hadoop on your local machine). This will be enough to illustrate what you need to do, though for production you want to set up a real distributed cluster.

Lets get back to our katta-tutorial folder

cd ..

and download Hadoop
wget http://apache.mirror.facebook.com/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz
tar -xzvf hadoop-0.20.1.tar.gz

Let’s setup some required environment variables
export HADOOP_HOME=~/katta-tutorial/<hadoop dir>
export PATH=$HADOOP_HOME/bin:$PATH
export JAVA_HOME=<path to java home>

Now we edit following files in HADOOP_HOME/conf:

hdfs-site.xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

mapred-site.xml

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

hadoop-env.sh
JAVA_HOME=<absolute path to java>

As the next step we need to configure ssh for our local machine. Under Mac OS X make sure you switch on Remote Login in System Perference > Sharing.

Try to ssh into ‘localhost’
ssh localhost

If this requires a password, execute the following commands

if ~/.ssh does not exist
> mkdir ~/.ssh
> chmod 700 ~/.ssh
> touch ~/.ssh/authorized_keys
> chmod 600 ~/.ssh/authorized_keys

Create new key

> ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
> cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Try the “ssh localhost” command again.

Make sure your local machine has a hostname it also can reverse lookup. Try:
export MY_HOST_NAME=`hostname`
ssh $MY_HOST_NAME

If this does not work you need to fix your network configureation. (Going offline usually helps :) )

Now we can start the Hadoop cluster:
> hadoop namenode -format
> start-all.sh

This should start Hadoop with one Namenode, one Jobtraker, one Datanode and one Tasktracker.

Open following urls in your browser and make sure you see one data node and one tasktracker.

http://localhost:50030/
http://localhost:50070/

Getting data on the Hadoop DFS

While different data sources will probably exist while indexing, if you want to leverage the power of Hadoop while indexing it is a good idea to get the data to the Hadoop DFS as raw text (e.g. log files) or as a sequence file.

Katta comes with some sample code that can be found in the extras/indexing project. If you want to store your data as a SequenceFile you might find the SequenceFileCreator interesting. Lets create some sample data and work with it:

cd ~/katta-tutorial/modules/katta-indexing-sample
ant jar
sh run.sh net.sf.katta.indexing.SequenceFileCreator hdfs://localhost:9000/sample ../../sample-data/texts/alice.txt 1000

Now we have to create a job jar. A job jar is a jar with our Lucene jar dependencies in a embedded lib folder.

Run our indexing job

Beside the SequenceFileCreator Katta also provides sample code for a Hadoop indexing job. The job is a basically only a Hadoop MapRunnable implementation that reads the sequence file records and adds them to a Lucene index on the local hard drive. When all records are indexed the local Lucene index is copied into the Hadoop DFS.

Lets start with creating a Hadoop job jar.
ant job-jar

Change directory to our version of Hadoop:
cd ~/katta-tutorial/hadoop-0.20.1

Lets make sure we have sample data:
bin/hadoop fs -ls /
You should see:
-rw-r--r-- 3 joa23 supergroup 60598 2009-03-16 01:54 /sample

bin/hadoop jar ~/katta-tutorial/katta/modules/katta-indexing-sample/build/katta-indexing-sample-*.jar hdfs://localhost:9000/sample hdfs://localhost:9000/index 2

Now we need to clean up this folder since Hadoop stored some folders there we do not want.
bin/hadoop fs -rmr /index/_logs
bin/hadoop fs -rmr /index/part-*

Congratulations you have generated a Katta index. Now lets try to deploy it.

Deploy the index in Katta

Our goal is to deploy the freshly generated index. First we need to start a localhost Katta grid, as described here.

Therefore we need 3 shell windows open and change directory in each into ~/katta-tutorial/katta/.

In window 1 we start the master:
bin/katta startMaster

In window 2 we start a node
bin/katta startNode

In window 3 we run the deploy command:
bin/katta addIndex testIndex hdfs://localhost:9000/index org.apache.lucene.analysis.StandardAnalyzer 2

This might take a some time, since Katta is downloading the index from Hadoop DFS to the local hard drive.

Now you can search with

bin/katta search testIndex "content:alice"

Congratulations! You have just created an index with Hadoop and deployed it into Katta.