Posts

Showing posts from 2016

How to build a datawarehouse on Hadoop

Image
Building data warehouse on Hadoop platform Data warehouse on Big Data platform is a standard use case many organizations are exploring. The reason for this approach could be one of many flexibility big data platform offer. Big Data for enterprise wide analytics - Many organization are moving to a concept of having data hub or data lake where data is gathered and stocked based on source systems, rather than project based or subject area based data warehouse. Advantage of the approach is to remodel and design marts anytime based on the requirement. Cost - Hadoop has become cheap and alternate storage medium. Faster analytics - Big Data platform or Hadoop common moniker for big data can take care of traditional batch systems as well as near-real time decision support(2 seconds to 2 minutes is the time taken to make an action since data delivery) or near real time event processing(100 milli seconds to 2 seconds this the time taken to make an action since data d...

Apache Spark -Window functions - Sort, Lead, Lag , Rank , Trend Analysis

Image
Spark Window functions - Sort, Lead, Lag, Rank, Trend Analysis This tech blog demonstrates how to use functions like withColumn, lead, lag, Level etc using Spark. Spark dataframe is an sql abstract layer on spark core functionalities. This enable user to write SQL on distributed data. Spark SQL supports hetrogenous file formats including JSON, XML, CSV , TSV etc. In this blog we have a quick overview of how to use spark SQL and dataframes for common use cases in SQL world.For the sake of simplicity we will deal with a single file which is CSV format. File has four fields, employeeID, employeeName, salary, salaryDate 1,John,1000,01/01/2016 1,John,2000,02/01/2016 1,John,1000,03/01/2016 1,John,2000,04/01/2016 1,John,3000,05/01/2016 1,John,1000,06/01/2016 Save this file as emp.dat. In the first step we will create a spark dataframe using , spark CSV package from databricks. val sqlCont = new HiveContext(sc) //Define a schema for file val schema = StructT...

Update and Delete on Hive table (Hive supports CRUD)

Image
Hive is an apache project which give ability to have relational database structure on hadoop platform. Hive uses hadoop core components like HDFS for data storage and Map reduce to execute jobs. Since big data platform is maturing beyond map reduce, hive project also started adopting new technology emerging on the platform. Execution engine in Hive can be changed to more advanced technology like Tez or Spark. The ability of Hive to form a relational database like structure on  files owes to its meta store. Hive meta store is a place, usually a relational database like MySQL or Derby which stored meta data about the files. File meta data include file structure, folder structure, partitions, buckets , serialization and De-serialization options for file read and write etc.  Usage of Hive for ETL is a very common practice these days. Hive's relational database capability helps to migrate data warehouse or data stores built in relational databases like Oracle, Teradata, Netez...

Spark Getting started - Develop using eclipse locally

Image
This article will help you to jump start on spark development on your PC or laptop (Windows) without having a fully functional Hadoop cluster installed. I use a  8 GB RAM , 128 GB storage, Windows 10  machine. These days I try to isolate development in various environments using Docker containers or Bluemix containers. Still sometimes I fall back to method of developing stuff on my local machine before deploying the code to cluster. This blog covers Setting up spark and eclipse as IDE for local development with bare minimal prerequisites. While I am writing this, Spark 1.5.1 is available and I am using the same. Follow below instructions to set up spark on your machine. Hadoop Installation on windows 1. Assuming your OS is windows, download and install Hadoop on windows. This may not be a fully functional Hadoop cluster but we are worried only about some libraries which spark will need later. Download Hadoop-2.6.0.tar.gz   . 2. You dont need to install , all you...

IBM BigInsights - Bigsheets (excel like interface to HDFS files and tables)

Image
Bigsheets is a browser-based tool that is included in the BigInsights data scientist package or data analyst package, to analyze and visualize big data. BigSheets uses a spreadsheet-like interface that can model, filter, combine, and chart data collected from multiple sources, such as an application work on big data environment. Since big sheets is a service running on big data cluster, user does not need to worry about connectivity. Big sheets is a service installed on cluster just like other services (Hive or hbase etc) In this demo we will see how to Create a master workbook from existing file in HDFS Tailor data by creating child workbook Create columns after grouping data How to create quick charts Export data to other formats Accessing BigSheets Bigsheets is available in application tab on IBM® InfoSphere® BigInsights™ Enterprise Edition . Click on Bigsheets tab and launch the application.     Bigsheets works on Ha...