Real time Analytics-Implementing a lambda architecture on Hadoop - Part 2
Hbase-Lily Indexer- Indexing data from Hbase to Solr by configuration
This is the second part of my 3 part blog series to achieve real time analytic capability. In this blog focus is to index data from Hbase to Solr just by configuration and very less development. If you have a web or mobile app it is nice to have a search capability on data- in order to achieve fuzzy search capability we use Solr. Since we already loaded data to Hbase as a part of ETL using Spark it is not necessary to have another ETL process to load Solr.
Lily Indexer is useful in indexing the data added/updated/deleted in Hbase database to Solr collection. This syncs the data in near real time. Indexing allows you to query data stored in HBase with the Solr service. The indexer
supports flexible, custom, application-specific rules to extract, transform,
and load HBase data into Solr. Solr search results can
contain columnFamily:qualifier links back to the data stored in
HBase. This way, applications can use the search result set to directly access
matching raw HBase cells.
Goal of this blog is to index data by configuration rather that code. Tasks we are trying to achieve can be split into
- Create Hbase table
- Create Solr collection
- Create morphline.conf
- Create mapper.xml
- Create indexer
- Insert data into hbase and test the solr
Create Hbase table & test
Data is loaded to Hbase using spark in part 1 of this series. How ever we will create a sample table.
It is important that table has to be created with REPLICATION_SCOPE as 1
put
'employee', '10001', 'info:eid', '10001'
put
'employee', '10001', 'info:ename', 'santosh'
put
'employee', '10002', 'info:eid', '10002'
put
'employee', '10002', 'info:ename', 'nivas'
Create Solr collection
In this step we create a basic solr collection with couple of fields.
Once the xml files are edited upload the entire conf directory to zookeeper and create a new solr collection. While creating solr collection pick shards and replication based on your hadoop cluster configuration.
solrctl
instancedir --update EMP /yourpath_where_conf_got_generated/EMP
solrctl
collection --create EMP -s2 -r2
Create morphline.conf
Data from Hbase can be moved to Solr
1. On a real time basis - when ever Hbase gets updated data can be indexed in Solr
2. On a Batch mode - if you want to move data in bulk to Solr like a one time initial activity
In this step we are focusing on indexing data on real time basis.
Morphlines is a open source framework that reduces the time and effort necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards. If you want to integrate, build, or facilitate transformation pipelines without programming and without substantial MapReduce skills, and get the job done with a minimum amount of fuss and support costs morphline can be leveraged.
A sample morphline file for our purpose looks like this. This configuration file has the mapping between Hbase table to Solr collection.
morphlines : [
{
id : emp_morphline
importCommands :
["org.kitesdk.morphline.**", "com.ngdata.**"]
commands : [
{
extractHBaseCells {
mappings : [
{
inputColumn : "info:eid"
outputField : "id"
type :
string
source :
value
}
{
inputColumn : "info:ename"
outputField : "ename"
type :
string
source :
value
}
]
}
}
{ logTrace {
format : "output record: {}", args : ["@{}"] } }
]
}
]
Create mapper.xml
We are almost there.
Once the configuration file is created we need to create a mapper xml file- which will be used to run the indexer program.
<?xml
version="1.0"?>
<indexer table="employee"
mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper"
>
<param name="morphlineFile" value="<your patth to morphline conf >/emp_morphlines.conf"/>
<param
name="morphlineId" value="emp_morphline"/>
</indexer>
Indexer Implementation
With the above step we are done with configuration, now we will see the commands to run create indexer, list, edit and debug
Create Index
export
HBASE_INDEXER_OPTS=-Djava.security.auth.login.config=/etc/hbase/conf/jaas.conf
hbase-indexer
add-indexer \
--name
emp_index \
--indexer-conf
<path>/emp_morphline_hbase_mapper.xml \
--connection-param
solr.zk="zookeeper quorum" \
--connection-param
solr.collection=EMP \
--zookeeper "zookeeper quorum"
List all indexers with all details mapper.xml
hbase-indexer
list-indexers --http http://<domain>:11060/indexer/ \
--jaas
jaas.conf --zookeeper
"zookeeper quorum" --dump
Delete index if required or any errors
hbase-indexer
delete-indexer --name emp_index --zookeeper "zookeeper quorum"
Logs are generated in /var/log/hbase-solr/tail
–f lily*.out
With that step we have successfully implemented a Hbase-lily indexer!!
How to do one time load from Hbase to Solr using a batch indexer
Batch indexer is another utility used to do one time load in bulk from Hbase to Solr. There after a lily indexer can be set up for update on a real time basis. Batch
indexer index the existing data on Hbase to Solr with map reduce based jar
execution.
HADOOP_OPTS="-Djava.security.auth.login.config=/opt/solr/jaas.conf"
\
hadoop
--config /etc/hadoop/conf \
jar
/opt/cloudera/parcels/CDH/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar \
--conf
/etc/hbase/conf/hbase-site.xml -D 'mapred.child.java.opts=-Xmx500m' \
--hbase-indexer-file emp_morphline_hbase_mapper.xml \
--zk-host
"zookeeper quorum"/solr
--collection EMP \
--go-live
Note
: Few important things to remember:
1. Collection
should have a field ‘id’
2. Replication
scope on column family when creating the hbase table is mandatory



Comments
Post a Comment
50% of time, all the time I am wrong, Please correct me!!