Posts

Parsing a long text and fetching desired pattern strings using SPARK SQL

Image
  One of my co-workers from analyst group has approached me for helping with a real-time problem that she was facing in extracting all matching words/substrings within a long text column using spark SQL. The long text is basically an email body that a seller sends out to his/her customers with many urls like their store webpage, facebook page etc. So first thing obviously she tried was using  regex_extract  built-in function. SET spark. sql .parser.escapedStringLiterals= true ; select REGEXP_EXTRACT( lower ( ‘ Thanks for your purchase Leave your feedback “ http //feedback.abc.com ” . Like us on facebook “ https //www.facebook.com/abcPets ". Visit our site here “www.abcpets.com" and and save 7 of your next purchase use code save7 at checkout. Call us 123 456 7890 . ’ ), ’ ((http //|https //|http://|https://|(www.)) \ w+ \ . \ w+) ' , 0 ) AS URL_PATTERN_1 ; But the result was only first matching string as shown below: URL_PATTERN_1 — — — — — — — — — - http...

Teradata - Useful SQLs

1. How to create a copy of table using an existing table including data in one line? create multiset table < NewTableDB>.<NewTableName> as <ExistingTableDB>.<ExistingTableName> with data ; 2. Some times we may need to review the table level stats like who, when table has been created, when was it last accessed etc for various reasons. In such scenarios you can use the following sql -- Table space in GB along with last access time in a specific database SELECT   a.databasename,   a. tablename ,   t.creatorname, t.CreateTimestamp, t.LastAccessTimestamp, SUM (a.currentperm)/1024**3 ( FORMAT 'zz9.99%' ) AS current_GB FROM dbc .allspace a join dbc .tablesv t on a.databasename = t.databasename and a. tablename = t. tablename WHERE -- tablename =   '<tblname' AND   a.databasename = '<dbname>' GROUP BY 1,2,3,4,5 ORDER BY current_GB desc ; 3. When you need to find specific SQL ...

SQL to pivot columns to rows

-- Create a sample table   create multiset table mydbname.stscorecard (     id integer ,     sname varchar (20),     Math_score integer ,     Science_score integer ,     Social_score integer ) primary index (id) ; insert into mydbname.stscorecard (id,sname,Math_score, Science_score, Social_score) values (1, 'Student1' , 91,92,93) ; insert into mydbname.stscorecard (id,sname,Math_score, Science_score, Social_score) values (2, 'Student2' , 94,95,96) ; -- Following query demonstrates how to pivot a single row into multiple rows -- Assumption here is number of columns to pivot to rows is costant number.   -- In this example as you see we have 3 subjects only. select * from   ( select id,sname, 'Maths' ( varchar (20)) as subject, Math_score as Score   from mydbname.stscorecard union all select id,sname, 'Science' as subject, Science_score as Sc...

SQL to pivot rows to columns

-- Create a sample table   create multiset table mydbname.stscorecard (     id integer ,     sname varchar (20),     subject varchar (20),     score integer ) primary index (id) ; -- Insert sample data for student1 insert into mydbname.stscorecard (id, sname, subject, score) values (1, 'student1' , 'Maths' ,90); insert into mydbname.stscorecard (id, sname, subject, score) values (1, 'student1' , 'Science' ,91); insert into mydbname.stscorecard (id, sname, subject, score) values (1, 'student1' , 'Social' ,99); -- Insert sample data for student2 insert into mydbname.stscorecard (id, sname, subject, score) values (1, 'student2' , 'Maths' ,92); insert into mydbname.stscorecard (id, sname, subject, score) values (1, 'student2' , 'Science' ,93); insert into mydbname.stscorecard (id, sname, subject, score) values (1, 'student2...

Hive Table usage in Hadoop

In data engineering world, many times we may have to make schema changes to an existing table or rebuild an existing table's data due to multiple reasons like bugs fell through the cracks or new business requirements and you realized the data needs rebuild to rectify, and so on. Very first thing that we need to understand in such scenarios is; what was the impact or who has been using that data in the past 0-6 months or even more depending on the criticality of the data set. To answer such queries, following code snippet will be very useful to identify the impacted users: << Step:1 hadoop jar /apache/hadoop/share/hadoop/tools/lib/hadoop-streaming- 2.7.1.2.4.2.66-4 .jar \ -Dstream.non.zero.exit.is.failure=false \ -Dmapred.job.queue.name= <QUEUE Name> \ -Dmapred.job.name="grepper" \ -Dmapred.reduce.tasks=1 \ -input /logs/<HADOOP NAMENODE>/auditlog/YYYY-* \ -output <a HDFS location where your account has write access>/ \ -mapper ...

Spark SQL - Performance Tuning tips

In this article I am sharing my understanding and experiences with Spark SQL since we just completed migrating 300+ tables in my domain, from Teradata based ETL processes to Spark SQL process using HDFS as backed end data storage cluster. How can we maintain the distribution of data like we have in Teradata based on Primary Index? Spark does support this feature using “ distribute by ” method. As the name reflects it distributes the data based on the columns/(valid expressions) you pass it distributes the data across the partitions evenly. Of course if the columns/(valid expressions) chosen by us itself is not balanced, again we will have uneven distribution of data. Which you may even have experienced in Teradata by choosing inappropriate primary index column. Well, why do we have to distribute the data evenly in all partitions? Answer is simple. Let’s say in your team there are 4 engineers and your scrum lead is assigning the work load unevenly, will it be of any good? ...