Hive Table usage in Hadoop

February 18, 2020

In data engineering world, many times we may have to make schema changes to an existing table
or rebuild an existing table's data due to multiple reasons like bugs fell through the cracks
or new business requirements and you realized the data needs rebuild to rectify, and so on.

Very first thing that we need to understand in such scenarios is; what was the impact or who has been using that data in the past 0-6 months or even more depending on the criticality of the data set.

To answer such queries, following code snippet will be very useful to identify the impacted users:

<<
Step:1

hadoop jar /apache/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.2.4.2.66-4.jar \
-Dstream.non.zero.exit.is.failure=false \
-Dmapred.job.queue.name=<QUEUE Name> \
-Dmapred.job.name="grepper" \
-Dmapred.reduce.tasks=1 \
-input /logs/<HADOOP NAMENODE>/auditlog/YYYY-* \
-output <a HDFS location where your account has write access>/ \
-mapper 'zgrep <table name that you are looking to identify usage>'

Step2:

hadoop fs -text <a HDFS location where your account has write access>/part-00000.gz | awk '{print $6}' | sort | uniq

Above snippet helps us to parse the logs using streaming jar for a specific date regex, and writes only the interested logs information to a partition/file.

So that we can review the users list and send eNotification as needed to notify the users on the impact.

Big Data Engineering

Hive Table usage in Hadoop

Comments

Post a Comment

Popular posts from this blog

Spark SQL - Performance Tuning tips

Teradata - Useful SQLs

SQL to pivot rows to columns