Tale of stream and batch data ingestion

May 15, 2019

Data ingestion is an inevitable part of analytics — be it saving data entered by users from a website or moving a large volume of data from machines and sensors. This blog talks about the evolution and trends in data ingestion, and modern architecture principles to gain insights from the ingested data.

Since a preface is set, let's have a look at how data is ingested since the inception of modern data warehousing and analytics.

Good old file transfer — Files from transaction systems, all kinds of files were transferred using some file transfer protocol
Database entry — Enter the data to databases directly
Queue mechanism — Use more reliable queue mechanism to transfer the data to and from

All the above methods are efficient and solve the data ingestion problems to a great extent. But with the offset of new data sources e.g. social media feeds, IoT systems, telemetry data, machine data, never-ending log files, etc. conventional way of data ingestion became more challenging.

All of a sudden enterprises are looking to gain competitive advantage by meshing the new data with the legacy data they already have. Classic examples include retail industry using social media feed to upsell and cross-sell products, insurance industry using telemetry data to rate drivers.

Now, if we look at the data ingestion pattern we can broadly classify into two

Fast moving data or stream of data — Data emitted by sensor, social media feeds, etc. The data can be a continuous flow of information or in trickling nature. The volume of data and the speed of data is very unpredictable here. e.g. User interaction on a website at peak hours.
Static data or slow-moving data — Data files received from a third party vendor once in a day or summarized transactions from a source of records. Here, data volume, the velocity of data movement, etc are more predictable

For a long time, data engineers had trouble dealing with this variety of data coming into the analytic system. But with the advent of “big data technologies”, we are in a better position to deal with this kind of data pattern. Data engineers started to implement Lambda architecture using a variety of technologies on the big data ecosystem. Basically, lambda architecture takes care of processing fast-moving stream data and static batch data with two separate ELT(Extract load and transform) pipeline finally merged into a single analytic zone.

Advantages

No need to wait for daily loads or weekly loads to complete — Information is readily available to consume.
Ability to mesh internal data with third-party data to give a more holistic view of the business.

Disadvantages

Need a variety of technology and skillset to handle this — Right now technologies in big data ecosystem is good at doing one aspect of analytics. e.g. Apache Hive is the defacto database for analytic queries, whereas Apache HBase is good at random reads and writes, Apache Solr is perfect for text search and text analytics.
ETL processing is not mature enough to handle two different types of workloads.
Lack of flexibility to have ETL/ELT loads balanced on the same system. e.g. System tuned for analytic queries cannot efficiently handle stream processing and vice-versa.

Let us know if you want to know more about the implementation of lambda architecture and handling fast moving data and data warehouse workloads on Hadoop . Read my blog on implementing lambda architecture here

Search This Blog

Search for actionable intelligence