Hive Table usage in Hadoop

In data engineering world, many times we may have to make schema changes to an existing table
or rebuild an existing table's data due to multiple reasons like bugs fell through the cracks
or new business requirements and you realized the data needs rebuild to rectify, and so on.

Very first thing that we need to understand in such scenarios is; what was the impact or who has been using that data in the past 0-6 months or even more depending on the criticality of the data set.

To answer such queries, following code snippet will be very useful to identify the impacted users:

<<
Step:1

hadoop jar /apache/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.2.4.2.66-4.jar \
-Dstream.non.zero.exit.is.failure=false \
-Dmapred.job.queue.name=<QUEUE Name> \
-Dmapred.job.name="grepper" \
-Dmapred.reduce.tasks=1 \
-input /logs/<HADOOP NAMENODE>/auditlog/YYYY-* \
-output <a HDFS location where your account has write access>/ \
-mapper 'zgrep <table name that you are looking to identify usage>'


Step2:

hadoop fs -text <a HDFS location where your account has write access>/part-00000.gz | awk '{print $6}' | sort | uniq

>>

Above snippet helps us to parse the logs using streaming jar for a specific date regex, and writes only the interested logs information to a partition/file.

So that we can review the users list and send eNotification as needed to notify the users on the impact.




Comments

Popular posts from this blog

Parsing a long text and fetching desired pattern strings using SPARK SQL

SQL to pivot rows to columns

Spark SQL - Performance Tuning tips