Search for actionable intelligence

Posts

Showing posts from May, 2016

Apache Spark -aggregate functions explained (reduceByKey, groupByKey and combineByKey)

May 24, 2016

Easy explanation on difference between spark's aggregate functions (reduceByKey, groupByKey and combineByKey) Spark comes with a lot of easy to use aggregate functions out of the box. For the same reason spark becomes a powerful technology for ETL on BigData. Grouping the data is a very common use case in world of ETL(Extract , Transform and Load). Just like aggregate transformation in ETL tools like Ab-initio or Informatica, where the results can be grouped and aggregate functions can be applied. e.g. Group all customer order based on customer key, find the best sales year , find the worst player in baseball based on strike rate etc. etc. Unlike standard ETL tools sparks comes with three transformations to achieve the same result but in different ways. reduceByKey groupByKey combineByKey These are three transformation available in spark which can be used interchangeably. Before getting to further details, it is important to understand all this ...

May 22, 2016

How to Insert data to remote Hive server from Spark Spark is the buzz word in world of BigData now. So what makes Spark so unique? As we know, Spark is fast - it use in memory computation on special data objects called RDD (Resilient distributed data set) Spark allows execution on multiple modes i.e. run standalone, run local (without even a hadoop server), on cluster through resource managers (Mesos, YARN) Spark take care of data lineage, fault recovery through DAG(Direct Acyclic Graph) as blue print for execution, which can be rebuilt at any point in case of failures Easy APIs - Easy to use APIs Read from anywhere - Data can be read from different types of sources i.e. files, json, databases etc. e.g. CassandraAPI Write to anywhere - Result data can be saved to any format Multiple language support - Spark supports scala, java , python. For people from database and SQL background Spark is simplified, to run SQLs on RDD (called dataframes) through SparkSQL(known as shark e...