This tutorial will show you how to create a simple Katta index with Hadoop. Since a Katta index is basically just a folder with Lucene index sub folders, creating a Katta index could be as easy as copying a set of Lucene indexes into a folder.
Creating a Lucene index is a relatively straightforward process. If you have never used Lucene before please read an excellent introduction to Lucene. The Lucene in Action book by Erik Hatcher and Otis Gospodnetić is also a great place to start.
When you have a lot of data, then the Lucene index can get to be unwieldily in terms of size and performance. In this scenario creating a Katta index with Lucene is a good idea since you can leverage the power of distributing the indexing over many machines using Hadoop.
Katta comes with sample code that we will use in this tutorial. Since Katta is released with the Apache License 2 feel free to copy, paste, modify code and build scripts to use it as starting point for your application.
Our steps will be:
- Building Katta from sources
- Setting up a Hadoop cluster
- Getting data on the Hadoop DFS
- Run our indexing job
- Deploy the index in Katta
Build Katta from sources
First create a working folder and checkout the code from git.
git clone https://github.com/sgroschupf/katta.git
Now lets compile the code. Make sure you have java 1.6 installed. The code will not compile with java 1.5.
Congratulations that was easy, next lets setup a hadoop.
Setting up Hadoop cluster
Setting up Hadoop is the most time consuming and complicated part. For this tutorial we use a localhost cluster (i.e. you run Hadoop on your local machine). This will be enough to illustrate what you need to do, though for production you want to set up a real distributed cluster.
Lets get back to our katta-tutorial folder
and download Hadoop
tar -xzvf hadoop-0.20.1.tar.gz
Let’s setup some required environment variables
export HADOOP_HOME=~/katta-tutorial/<hadoop dir>
export JAVA_HOME=<path to java home>
Now we edit following files in HADOOP_HOME/conf:
JAVA_HOME=<absolute path to java>
As the next step we need to configure ssh for our local machine. Under Mac OS X make sure you switch on Remote Login in System Perference > Sharing.
Try to ssh into ‘localhost’
If this requires a password, execute the following commands
if ~/.ssh does not exist
> mkdir ~/.ssh
> chmod 700 ~/.ssh
> touch ~/.ssh/authorized_keys
> chmod 600 ~/.ssh/authorized_keys
Create new key
> ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
> cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Try the “ssh localhost” command again.
Make sure your local machine has a hostname it also can reverse lookup. Try:
If this does not work you need to fix your network configureation. (Going offline usually helps )
Now we can start the Hadoop cluster:
> hadoop namenode -format
This should start Hadoop with one Namenode, one Jobtraker, one Datanode and one Tasktracker.
Open following urls in your browser and make sure you see one data node and one tasktracker.
Getting data on the Hadoop DFS
While different data sources will probably exist while indexing, if you want to leverage the power of Hadoop while indexing it is a good idea to get the data to the Hadoop DFS as raw text (e.g. log files) or as a sequence file.
Katta comes with some sample code that can be found in the extras/indexing project. If you want to store your data as a SequenceFile you might find the SequenceFileCreator interesting. Lets create some sample data and work with it:
sh run.sh net.sf.katta.indexing.SequenceFileCreator hdfs://localhost:9000/sample ../../sample-data/texts/alice.txt 1000
Now we have to create a job jar. A job jar is a jar with our Lucene jar dependencies in a embedded lib folder.
Run our indexing job
Beside the SequenceFileCreator Katta also provides sample code for a Hadoop indexing job. The job is a basically only a Hadoop MapRunnable implementation that reads the sequence file records and adds them to a Lucene index on the local hard drive. When all records are indexed the local Lucene index is copied into the Hadoop DFS.
Lets start with creating a Hadoop job jar.
Change directory to our version of Hadoop:
Lets make sure we have sample data:
bin/hadoop fs -ls /
You should see:
-rw-r--r-- 3 joa23 supergroup 60598 2009-03-16 01:54 /sample
bin/hadoop jar ~/katta-tutorial/katta/modules/katta-indexing-sample/build/katta-indexing-sample-*.jar hdfs://localhost:9000/sample hdfs://localhost:9000/index 2
Now we need to clean up this folder since Hadoop stored some folders there we do not want.
bin/hadoop fs -rmr /index/_logs
bin/hadoop fs -rmr /index/part-*
Congratulations you have generated a Katta index. Now lets try to deploy it.
Deploy the index in Katta
Our goal is to deploy the freshly generated index. First we need to start a localhost Katta grid, as described here.
Therefore we need 3 shell windows open and change directory in each into ~/katta-tutorial/katta/.
In window 1 we start the master:
In window 2 we start a node
In window 3 we run the deploy command:
bin/katta addIndex testIndex hdfs://localhost:9000/index org.apache.lucene.analysis.StandardAnalyzer 2
This might take a some time, since Katta is downloading the index from Hadoop DFS to the local hard drive.
Now you can search with
bin/katta search testIndex "content:alice"
Congratulations! You have just created an index with Hadoop and deployed it into Katta.