Beginning with version 0.5 (trunk), Katta can be run on Amazon EC2. EC2 is a services that allows to rent servers on demand. Billing is hourly per host. This allows you to run a Katta grid without owning and operating your own hardware, which gives you an incredible price advantage. But even more importantly it allows you to scale a Katta grid up or down, based on your demand.
To get started you need a working EC2 account and should at least run through the EC2 Getting Started Guide.
All commands starting with % are executed on your local machine, and all commands starting with # are executed on an EC2 instance.
Run a Katta Cluster
- Get the latest Katta release or sources (at least 0.5 or higher)
- Create a katta-ec2-env.sh file in KATTA_HOME/extras/ec2/bin from katta-ec2-env.sh.template
- Edit the following variables in your new katta-ec2-env.sh. You should have appropriate values after completing the EC2 Getting Started Guide:
- KATTA_VERSION and S3_BUCKET - if want to control which Katta image is used
- JAVA_BINARY_URL - a fresh link to java binary, only if you plan to build a custom Katta image
- Naviage to the EC2 script folder at KATTA_HOME/extras/ec2/
- Start a Katta cluster via:
% bin/katta-ec2 launch-cluster test-cluster 2
test-cluster is the name of the cluster and 2 the number of nodes in the cluster. Be patient, as starting the nodes might take a little time, even if you’ve already started the Linux OS on your nodes.
- Login into the master of the cluster via:
% bin/katta-ec2 login test-cluster
- On the master you can use the Katta command line to manage Katta. Note that the KATTA_HOME/bin folder is in your shell path already.
# katta listNodes
# katta listIndices- no index should be registered by default
- To shutdown your cluster:
% bin/katta-ec2 terminate-cluster test-cluster
- To clean up you can delete the created group by
% bin/delete-cluster test-cluster
Monitor the cluster
The Katta AIM comes with UC Berkeley’s ganglia pre-installed to allow you to monitor server load. Since EC2 nodes are by default secured by a firewall you need to set up a socket proxy to access ganglia.
- Create a socks proxy tunnel to the master
% bin/katta-ec2 proxy test-cluster
- Now you need to set up a socks proxy
- On Mac OS X launch System Preferences and go to Network Settings > Advanced > Proxy. Check socket proxy and use localhost 6666 as the proxy server
- Alternatively you can use foxyproxy for Firefox
- Point your browser at http://INSTANCE_NAME.amazonaws.com/ganglia as provided in the output of the proxy command.
- Please note it might take some minutes until ganglia shows all graphs since it take times to gather all metrics. Just refresh your browser page after a few minutes.
Mount an Hard Drive on all Nodes
Katta requires a shared volume on all nodes to be able to load index shards. There are three different options.
- Run a Hadoop cluster and use the HDFS to store your very large index. Please note that you must manually set security between the Katta cluster and the Hadoop cluster, so that Katta can access the HDFS ports on the Hadoop cluster.
- Use S3 and the Hadoop interface for S3. Please see the Hadoop documentation for this topic.
- Share a hard drive on all nodes using NFS. That might be the easiest solution for testing and smaller to mid size clusters. You can run the NFS server on the master but for production we suggest you manually start a separated EC2 instance that is just your file server. Currently large EC2 servers have 1690 GB hard drives. However data is lost the moment you shut down the NFS server instance. An interesting alternative might be the Amazon Elastic Block Store.
To use NFS, first find an image that suites your needs, and make sure NFS is installed:
- find images with:
% ec2-describe-images -a
- start an instance with:
% ec2-run-instances ami-2b5fba42 -g test-cluster -k katta-keypair -t m1.smallWhere
- ami-2b5fba42 is the image name
- test-cluster is name of your katta cluster and also name of your security group
- katta-keypair is the name of your keypair you created walking through the Getting Started Guide
- m1.small is the EC2 instance (server) size, please note that you might use a different image in case you use larger instance sizes
- Make sure the instance is started up with:
% ec2-describe-instances- wait until you see ‘running’
- Now start the NFS server and mount it to all worker nodes with:
% bin/katta-ec2 mount INSTANCE_ID /data/mount test-cluster
- Note since we use restart and not start to start the NFS services. You might see a couple [FAILED] messages, just ignore them for now.
- Now you should have a shared /data/mount/ on all nodes.
- Please note that you lose all data in this folder when you shutdown the NFS server or terminate the cluster.
Building an Image
The Katta EC2 scripts even provide you with the possibility to create your own image. To do this, you’ll want to customize the configuration in the katta-ec2-env.sh script. Select a base Image and make sure you have a fresh jdk link (choose 32 or 64 depending on your instance size), since this link life time is just a couple minutes. Also the instance size (among other things) can be customized. If you want to install some extra tools then line 38 in the create-katta-image-remote.sh script would be a good place to add those. The location and name of the image can be customized with the variables S3_BUCKET and KATTA_VERSION.
Build the image than with:
bin/katta-ec2 create-image - This might take a couple of minutes.
For the image creation we start an instance (make sure you terminate the instance when you are done)
If you want to you can publish your new Image with:
ec2-modify-image-attribute AMI-ID --launch-permission -a all