Posts

Tale of stream and batch data ingestion

Image
D ata ingestion is an inevitable part of analytics — be it saving data entered by users from a website or moving a large volume of data from machines and sensors. This blog talks about the evolution and trends in data ingestion, and modern architecture principles to gain insights from the ingested data. Since a preface is set, let's have a look at how data is ingested since the inception of modern data warehousing and analytics. Good old file transfer — Files from transaction systems, all kinds of files were transferred using some file transfer protocol Database entry — Enter the data to databases directly Queue mechanism — Use more reliable queue mechanism to transfer the data to and from All the above methods are efficient and solve the data ingestion problems to a great extent. But with the offset of new data sources e.g. social media feeds, IoT systems, telemetry data, machine data, never-ending log files, etc. conventional way of data ingestion became mo...

Real time Analytics-Implementing a lambda architecture on Hadoop - Part 2

Image
Hbase-Lily Indexer- Indexing data from Hbase to Solr by configuration This is the second part of my 3 part blog series to achieve real time analytic capability. In this blog focus is to index data from Hbase to Solr just by configuration and very less development. If you have a web or mobile app it is nice to have a search capability on data- in order to achieve fuzzy search capability we use Solr. Since we already loaded data to Hbase as a part of ETL using Spark  it is not necessary to have another ETL process to load Solr. Lily Indexer is useful in indexing the data added/updated/deleted in Hbase database to Solr collection. This syncs the data in near real time.  Indexing allows you to query data stored in HBase with the Solr service.  The indexer supports flexible, custom, application-specific rules to extract, transform, and load HBase data into Solr. Solr search results can contain columnFamily:qualifier links back to the data stored in HBase. ...

Real time Analytics-Implementing a lambda architecture on Hadoop

Image
Implement lambda architecture with fewer steps - using Spark, Hbase, Solr, Hbase-lily indexer and Hive Welcome to three part tutorial of getting your data available for consumption on real time (near) basis. Data domain has so advanced where decision making has to rapid hence we are gradually moving away from batch based data load (ETL) and tending towards real time analytics. With data being center of your strategies and decision making, getting data available sooner is pivotal for all organization. This three part tech blog explains about implementing lambda architecture (architecture supporting batch and real time analytics alike). Overall architecture for such projects is to cater three needs 1. Quick data access for web sites - Random access of data pattern e.g. a particular profile id or customer id or a comment key 2. Fast searches on random texts, fuzzy search, search suggestion e.g  customer name, product name etc. 3. Analytical query support for BI tools like C...

How to build a datawarehouse on Hadoop

Image
Building data warehouse on Hadoop platform Data warehouse on Big Data platform is a standard use case many organizations are exploring. The reason for this approach could be one of many flexibility big data platform offer. Big Data for enterprise wide analytics - Many organization are moving to a concept of having data hub or data lake where data is gathered and stocked based on source systems, rather than project based or subject area based data warehouse. Advantage of the approach is to remodel and design marts anytime based on the requirement. Cost - Hadoop has become cheap and alternate storage medium. Faster analytics - Big Data platform or Hadoop common moniker for big data can take care of traditional batch systems as well as near-real time decision support(2 seconds to 2 minutes is the time taken to make an action since data delivery) or near real time event processing(100 milli seconds to 2 seconds this the time taken to make an action since data d...

Apache Spark -Window functions - Sort, Lead, Lag , Rank , Trend Analysis

Image
Spark Window functions - Sort, Lead, Lag, Rank, Trend Analysis This tech blog demonstrates how to use functions like withColumn, lead, lag, Level etc using Spark. Spark dataframe is an sql abstract layer on spark core functionalities. This enable user to write SQL on distributed data. Spark SQL supports hetrogenous file formats including JSON, XML, CSV , TSV etc. In this blog we have a quick overview of how to use spark SQL and dataframes for common use cases in SQL world.For the sake of simplicity we will deal with a single file which is CSV format. File has four fields, employeeID, employeeName, salary, salaryDate 1,John,1000,01/01/2016 1,John,2000,02/01/2016 1,John,1000,03/01/2016 1,John,2000,04/01/2016 1,John,3000,05/01/2016 1,John,1000,06/01/2016 Save this file as emp.dat. In the first step we will create a spark dataframe using , spark CSV package from databricks. val sqlCont = new HiveContext(sc) //Define a schema for file val schema = StructT...

Update and Delete on Hive table (Hive supports CRUD)

Image
Hive is an apache project which give ability to have relational database structure on hadoop platform. Hive uses hadoop core components like HDFS for data storage and Map reduce to execute jobs. Since big data platform is maturing beyond map reduce, hive project also started adopting new technology emerging on the platform. Execution engine in Hive can be changed to more advanced technology like Tez or Spark. The ability of Hive to form a relational database like structure on  files owes to its meta store. Hive meta store is a place, usually a relational database like MySQL or Derby which stored meta data about the files. File meta data include file structure, folder structure, partitions, buckets , serialization and De-serialization options for file read and write etc.  Usage of Hive for ETL is a very common practice these days. Hive's relational database capability helps to migrate data warehouse or data stores built in relational databases like Oracle, Teradata, Netez...