Big Data Engineering

Posts

Teradata - Useful SQLs

April 21, 2020

1. How to create a copy of table using an existing table including data in one line? create multiset table < NewTableDB>.<NewTableName> as <ExistingTableDB>.<ExistingTableName> with data ; 2. Some times we may need to review the table level stats like who, when table has been created, when was it last accessed etc for various reasons. In such scenarios you can use the following sql -- Table space in GB along with last access time in a specific database SELECT a.databasename, a. tablename , t.creatorname, t.CreateTimestamp, t.LastAccessTimestamp, SUM (a.currentperm)/1024**3 ( FORMAT 'zz9.99%' ) AS current_GB FROM dbc .allspace a join dbc .tablesv t on a.databasename = t.databasename and a. tablename = t. tablename WHERE -- tablename = '<tblname' AND a.databasename = '<dbname>' GROUP BY 1,2,3,4,5 ORDER BY current_GB desc ; 3. When you need to find specific SQL ...

SQL to pivot columns to rows

March 08, 2020

-- Create a sample table create multiset table mydbname.stscorecard ( id integer , sname varchar (20), Math_score integer , Science_score integer , Social_score integer ) primary index (id) ; insert into mydbname.stscorecard (id,sname,Math_score, Science_score, Social_score) values (1, 'Student1' , 91,92,93) ; insert into mydbname.stscorecard (id,sname,Math_score, Science_score, Social_score) values (2, 'Student2' , 94,95,96) ; -- Following query demonstrates how to pivot a single row into multiple rows -- Assumption here is number of columns to pivot to rows is costant number. -- In this example as you see we have 3 subjects only. select * from ( select id,sname, 'Maths' ( varchar (20)) as subject, Math_score as Score from mydbname.stscorecard union all select id,sname, 'Science' as subject, Science_score as Sc...

SQL to pivot rows to columns

March 08, 2020

-- Create a sample table create multiset table mydbname.stscorecard ( id integer , sname varchar (20), subject varchar (20), score integer ) primary index (id) ; -- Insert sample data for student1 insert into mydbname.stscorecard (id, sname, subject, score) values (1, 'student1' , 'Maths' ,90); insert into mydbname.stscorecard (id, sname, subject, score) values (1, 'student1' , 'Science' ,91); insert into mydbname.stscorecard (id, sname, subject, score) values (1, 'student1' , 'Social' ,99); -- Insert sample data for student2 insert into mydbname.stscorecard (id, sname, subject, score) values (1, 'student2' , 'Maths' ,92); insert into mydbname.stscorecard (id, sname, subject, score) values (1, 'student2' , 'Science' ,93); insert into mydbname.stscorecard (id, sname, subject, score) values (1, 'student2...

Hive Table usage in Hadoop

February 18, 2020

In data engineering world, many times we may have to make schema changes to an existing table or rebuild an existing table's data due to multiple reasons like bugs fell through the cracks or new business requirements and you realized the data needs rebuild to rectify, and so on. Very first thing that we need to understand in such scenarios is; what was the impact or who has been using that data in the past 0-6 months or even more depending on the criticality of the data set. To answer such queries, following code snippet will be very useful to identify the impacted users: << Step:1 hadoop jar /apache/hadoop/share/hadoop/tools/lib/hadoop-streaming- 2.7.1.2.4.2.66-4 .jar \ -Dstream.non.zero.exit.is.failure=false \ -Dmapred.job.queue.name= <QUEUE Name> \ -Dmapred.job.name="grepper" \ -Dmapred.reduce.tasks=1 \ -input /logs/<HADOOP NAMENODE>/auditlog/YYYY-* \ -output <a HDFS location where your account has write access>/ \ -mapper ...

Spark SQL - Performance Tuning tips

February 12, 2020

In this article I am sharing my understanding and experiences with Spark SQL since we just completed migrating 300+ tables in my domain, from Teradata based ETL processes to Spark SQL process using HDFS as backed end data storage cluster. How can we maintain the distribution of data like we have in Teradata based on Primary Index? Spark does support this feature using “ distribute by ” method. As the name reflects it distributes the data based on the columns/(valid expressions) you pass it distributes the data across the partitions evenly. Of course if the columns/(valid expressions) chosen by us itself is not balanced, again we will have uneven distribution of data. Which you may even have experienced in Teradata by choosing inappropriate primary index column. Well, why do we have to distribute the data evenly in all partitions? Answer is simple. Let’s say in your team there are 4 engineers and your scrum lead is assigning the work load unevenly, will it be of any good? ...