Real time Analytics-Implementing a lambda architecture on Hadoop

Hbase-Lily Indexer- Indexing data from Hbase to Solr by configuration

This is the second part of my 3 part blog series to achieve real time analytic capability. In this blog focus is to index data from Hbase to Solr just by configuration and very less development. If you have a web or mobile app it is nice to have a search capability on data- in order to achieve fuzzy search capability we use Solr. Since we already loaded data to Hbase as a part of ETL using Spark it is not necessary to have another ETL process to load Solr.

Lily Indexer is useful in indexing the data added/updated/deleted in Hbase database to Solr collection. This syncs the data in near real time. Indexing allows you to query data stored in HBase with the Solr service. The indexer supports flexible, custom, application-specific rules to extract, transform, and load HBase data into Solr. Solr search results can contain columnFamily:qualifier links back to the data stored in HBase. This way, applications can use the search result set to directly access matching raw HBase cells.

Goal of this blog is to index data by configuration rather that code. Tasks we are trying to achieve can be split into

Create Hbase table
Create Solr collection
Create morphline.conf
Create mapper.xml
Create indexer
Insert data into hbase and test the solr

Create Hbase table & test

Data is loaded to Hbase using spark in part 1 of this series. How ever we will create a sample table.

create 'employee', {NAME => 'info', REPLICATION_SCOPE => 1}

It is important that table has to be created with REPLICATION_SCOPE as 1

put 'employee', '10001', 'info:eid', '10001'

put 'employee', '10001', 'info:ename', 'santosh'

put 'employee', '10002', 'info:eid', '10002'

put 'employee', '10002', 'info:ename', 'nivas'

Create Solr collection

In this step we create a basic solr collection with couple of fields.

solrctl instancedir --generate EMP

Now we edit the schema.xml file to add couple of fields

Once the xml files are edited upload the entire conf directory to zookeeper and create a new solr collection. While creating solr collection pick shards and replication based on your hadoop cluster configuration.

solrctl instancedir --update EMP /yourpath_where_conf_got_generated/EMP

solrctl collection --create EMP -s2 -r2

Create morphline.conf

Data from Hbase can be moved to Solr

1. On a real time basis - when ever Hbase gets updated data can be indexed in Solr

2. On a Batch mode - if you want to move data in bulk to Solr like a one time initial activity

In this step we are focusing on indexing data on real time basis.

Morphlines is a open source framework that reduces the time and effort necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards. If you want to integrate, build, or facilitate transformation pipelines without programming and without substantial MapReduce skills, and get the job done with a minimum amount of fuss and support costs morphline can be leveraged.

A sample morphline file for our purpose looks like this. This configuration file has the mapping between Hbase table to Solr collection.

morphlines : [

{

id : emp_morphline

importCommands : ["org.kitesdk.morphline.**", "com.ngdata.**"]

commands : [

{

extractHBaseCells {

mappings : [

{

inputColumn : "info:eid"

outputField : "id"

type : string

source : value

}

{

inputColumn : "info:ename"

outputField : "ename"

type : string

source : value

}

]

}

{ logTrace { format : "output record: {}", args : ["@{}"] } }

]

}

]

Create mapper.xml

We are almost there.

Once the configuration file is created we need to create a mapper xml file- which will be used to run the indexer program.

<?xml version="1.0"?>

<param name="morphlineFile" value="<your patth to morphline conf >/emp_morphlines.conf"/>

</indexer>

Indexer Implementation

With the above step we are done with configuration, now we will see the commands to run create indexer, list, edit and debug

Create Index

export HBASE_INDEXER_OPTS=-Djava.security.auth.login.config=/etc/hbase/conf/jaas.conf

hbase-indexer add-indexer \

--name emp_index \

--indexer-conf <path>/emp_morphline_hbase_mapper.xml \

--connection-param solr.zk="zookeeper quorum" \

--connection-param solr.collection=EMP \

--zookeeper "zookeeper quorum"

List all indexers with all details mapper.xml

hbase-indexer list-indexers --http http://<domain>:11060/indexer/ \

--jaas jaas.conf --zookeeper "zookeeper quorum" --dump

Delete index if required or any errors

hbase-indexer delete-indexer --name emp_index --zookeeper "zookeeper quorum"

Logs are generated in /var/log/hbase-solr/tail –f lily*.out

With that step we have successfully implemented a Hbase-lily indexer!!

How to do one time load from Hbase to Solr using a batch indexer

Batch indexer is another utility used to do one time load in bulk from Hbase to Solr. There after a lily indexer can be set up for update on a real time basis. Batch indexer index the existing data on Hbase to Solr with map reduce based jar execution.

HADOOP_OPTS="-Djava.security.auth.login.config=/opt/solr/jaas.conf" \

hadoop --config /etc/hadoop/conf \

jar /opt/cloudera/parcels/CDH/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar \

--conf /etc/hbase/conf/hbase-site.xml -D 'mapred.child.java.opts=-Xmx500m' \

--hbase-indexer-file emp_morphline_hbase_mapper.xml \

--zk-host "zookeeper quorum"/solr --collection EMP \

--go-live

Note : Few important things to remember:

1. Collection should have a field ‘id’

2. Replication scope on column family when creating the hbase table is mandatory

Search This Blog

Search for actionable intelligence