Posts

Showing posts from August, 2016

How to build a datawarehouse on Hadoop

Image
Building data warehouse on Hadoop platform Data warehouse on Big Data platform is a standard use case many organizations are exploring. The reason for this approach could be one of many flexibility big data platform offer. Big Data for enterprise wide analytics - Many organization are moving to a concept of having data hub or data lake where data is gathered and stocked based on source systems, rather than project based or subject area based data warehouse. Advantage of the approach is to remodel and design marts anytime based on the requirement. Cost - Hadoop has become cheap and alternate storage medium. Faster analytics - Big Data platform or Hadoop common moniker for big data can take care of traditional batch systems as well as near-real time decision support(2 seconds to 2 minutes is the time taken to make an action since data delivery) or near real time event processing(100 milli seconds to 2 seconds this the time taken to make an action since data d...

Apache Spark -Window functions - Sort, Lead, Lag , Rank , Trend Analysis

Image
Spark Window functions - Sort, Lead, Lag, Rank, Trend Analysis This tech blog demonstrates how to use functions like withColumn, lead, lag, Level etc using Spark. Spark dataframe is an sql abstract layer on spark core functionalities. This enable user to write SQL on distributed data. Spark SQL supports hetrogenous file formats including JSON, XML, CSV , TSV etc. In this blog we have a quick overview of how to use spark SQL and dataframes for common use cases in SQL world.For the sake of simplicity we will deal with a single file which is CSV format. File has four fields, employeeID, employeeName, salary, salaryDate 1,John,1000,01/01/2016 1,John,2000,02/01/2016 1,John,1000,03/01/2016 1,John,2000,04/01/2016 1,John,3000,05/01/2016 1,John,1000,06/01/2016 Save this file as emp.dat. In the first step we will create a spark dataframe using , spark CSV package from databricks. val sqlCont = new HiveContext(sc) //Define a schema for file val schema = StructT...

Update and Delete on Hive table (Hive supports CRUD)

Image
Hive is an apache project which give ability to have relational database structure on hadoop platform. Hive uses hadoop core components like HDFS for data storage and Map reduce to execute jobs. Since big data platform is maturing beyond map reduce, hive project also started adopting new technology emerging on the platform. Execution engine in Hive can be changed to more advanced technology like Tez or Spark. The ability of Hive to form a relational database like structure on  files owes to its meta store. Hive meta store is a place, usually a relational database like MySQL or Derby which stored meta data about the files. File meta data include file structure, folder structure, partitions, buckets , serialization and De-serialization options for file read and write etc.  Usage of Hive for ETL is a very common practice these days. Hive's relational database capability helps to migrate data warehouse or data stores built in relational databases like Oracle, Teradata, Netez...