Hive Table usage in Hadoop
In data engineering world, many times we may have to make schema changes to an existing table
or rebuild an existing table's data due to multiple reasons like bugs fell through the cracks
or new business requirements and you realized the data needs rebuild to rectify, and so on.
Very first thing that we need to understand in such scenarios is; what was the impact or who has been using that data in the past 0-6 months or even more depending on the criticality of the data set.
To answer such queries, following code snippet will be very useful to identify the impacted users:
<<
Step:1
hadoop jar /apache/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.2.4.2.66-4.jar \
-Dstream.non.zero.exit.is.failure=false \
-Dmapred.job.queue.name=<QUEUE Name> \
-Dmapred.job.name="grepper" \
-Dmapred.reduce.tasks=1 \
-input /logs/<HADOOP NAMENODE>/auditlog/YYYY-* \
-output <a HDFS location where your account has write access>/ \
-mapper 'zgrep <table name that you are looking to identify usage>'
Step2:
hadoop fs -text <a HDFS location where your account has write access>/part-00000.gz | awk '{print $6}' | sort | uniq
>>
Above snippet helps us to parse the logs using streaming jar for a specific date regex, and writes only the interested logs information to a partition/file.
So that we can review the users list and send eNotification as needed to notify the users on the impact.
or rebuild an existing table's data due to multiple reasons like bugs fell through the cracks
or new business requirements and you realized the data needs rebuild to rectify, and so on.
Very first thing that we need to understand in such scenarios is; what was the impact or who has been using that data in the past 0-6 months or even more depending on the criticality of the data set.
To answer such queries, following code snippet will be very useful to identify the impacted users:
<<
Step:1
hadoop jar /apache/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.2.4.2.66-4.jar \
-Dstream.non.zero.exit.is.failure=false \
-Dmapred.job.queue.name=<QUEUE Name> \
-Dmapred.job.name="grepper" \
-Dmapred.reduce.tasks=1 \
-input /logs/<HADOOP NAMENODE>/auditlog/YYYY-* \
-output <a HDFS location where your account has write access>/ \
-mapper 'zgrep <table name that you are looking to identify usage>'
Step2:
hadoop fs -text <a HDFS location where your account has write access>/part-00000.gz | awk '{print $6}' | sort | uniq
>>
Above snippet helps us to parse the logs using streaming jar for a specific date regex, and writes only the interested logs information to a partition/file.
So that we can review the users list and send eNotification as needed to notify the users on the impact.
Comments
Post a Comment