There is a functionality in Katta which allows merging of indices (NOTE: this functionality is in branch-0.3 only).

Multiple standalone indices can be merged to one new index.

Features:

  • deleting of duplicates (based on sortable index field)

Limitations:

  • The old indexes respectively the “old index names” getting lost.
  • Means if one searches indexB by name, this will fail after merging indexB with indexA. On the other hand searching against all indexes “*” should produce the same results before and after merging (with disrespect of total hit-count).

Process:

  • 2 map-reduce jobs are executed to do the merging
  • the new index is deployed to katta (If you had no duplicates before now you have, but there is a duplication filter in the search client so it shouldn’t bee too bad)
  • the old indexes are un-deployed from katta
  • the old indexes getting moved into the configured archive folder

HOW-TO

Configuration:

  • Configure the archive path (That’s where the old indexes are copied too, after the merging took place)
  • Configure key and sort field for your indexes in katta.index.properties (The key field is for filtering duplicates and the sort field is for choosing the right one of the duplicates to keep)

index.archive.path=/katta/archive
...
document.duplicate.information.class=net.sf.katta.index.indexer.merge.ConfigurableDocumentDuplicateInformation
net.sf.katta.index.indexer.merge.ConfigurableDocumentDuplicateInformation.keyField=id
net.sf.katta.index.indexer.merge.ConfigurableDocumentDuplicateInformation.sortField=timestamp

Execution:

  • goto katta distribution
  • either merge specific indices:

bin/katta mergeIndexes -indexes index1,index2,index3 -hadoopSiteXml ~/hadoop/conf/hadoop-site.xml

  • or merge all deployed indexes:


bin/katta mergeIndexes -hadoopSiteXml ~/hadoop/conf/hadoop-site.xml

The name of this new katta index has the format “mergedIndex-yymmdd.hhmmss” and is stored under the configured upload path.