Posts

Showing posts from February, 2020

Hive Table usage in Hadoop

In data engineering world, many times we may have to make schema changes to an existing table or rebuild an existing table's data due to multiple reasons like bugs fell through the cracks or new business requirements and you realized the data needs rebuild to rectify, and so on. Very first thing that we need to understand in such scenarios is; what was the impact or who has been using that data in the past 0-6 months or even more depending on the criticality of the data set. To answer such queries, following code snippet will be very useful to identify the impacted users: << Step:1 hadoop jar /apache/hadoop/share/hadoop/tools/lib/hadoop-streaming- 2.7.1.2.4.2.66-4 .jar \ -Dstream.non.zero.exit.is.failure=false \ -Dmapred.job.queue.name= <QUEUE Name> \ -Dmapred.job.name="grepper" \ -Dmapred.reduce.tasks=1 \ -input /logs/<HADOOP NAMENODE>/auditlog/YYYY-* \ -output <a HDFS location where your account has write access>/ \ -mapper ...

Spark SQL - Performance Tuning tips

In this article I am sharing my understanding and experiences with Spark SQL since we just completed migrating 300+ tables in my domain, from Teradata based ETL processes to Spark SQL process using HDFS as backed end data storage cluster. How can we maintain the distribution of data like we have in Teradata based on Primary Index? Spark does support this feature using “ distribute by ” method. As the name reflects it distributes the data based on the columns/(valid expressions) you pass it distributes the data across the partitions evenly. Of course if the columns/(valid expressions) chosen by us itself is not balanced, again we will have uneven distribution of data. Which you may even have experienced in Teradata by choosing inappropriate primary index column. Well, why do we have to distribute the data evenly in all partitions? Answer is simple. Let’s say in your team there are 4 engineers and your scrum lead is assigning the work load unevenly, will it be of any good? ...