Parsing a long text and fetching desired pattern strings using SPARK SQL

 One of my co-workers from analyst group has approached me for helping with a real-time problem that she was facing in extracting all matching words/substrings within a long text column using spark SQL.

The long text is basically an email body that a seller sends out to his/her customers with many urls like their store webpage, facebook page etc.

So first thing obviously she tried was using regex_extract built-in function.


SET spark.sql.parser.escapedStringLiterals=true;
select
REGEXP_EXTRACT(lower(Thanks for your purchase Leave your feedback  http //feedback.abc.com . Like us on
facebook https //www.facebook.com/abcPets". Visit our site here “www.abcpets.com" and and save 7 of your next purchase
use code save7 at checkout. Call us 123 456 7890.),((http //|https //|http://|https://|(www.))\w+\.\w+)',0) AS URL_PATTERN_1 ;

But the result was only first matching string as shown below:

URL_PATTERN_1
— — — — — — — — — -
http //feedback.abc

So, to mitigate this the alternate solution was to either write UDF or leverage some other builtin functions in tandem to get the desired result. So, here is my solution without the need to write UDF.

To illustrate this solution, first we need some sample data. So here it is:


/* Creating sample data for testing */
create or replace temporary view v1
as
select 1 as email_num, Sender1 as sender, Receiver1 as receiver, Thanks for your purchase Leave your feedback  http //feedback.abc.com . Like us on
facebook https //www.facebook.com/abcPets.Visit our site here www.abcpets.com and and save 7 of your next purchase
use code save7 at checkout. Call us 123 456 7890. as email_text
union
select 2 as email_num, Sender2 as sender, Receiver2 as receiver, Thanks for your  http //feedback.toys.com purchase Leave your feedback  http //feedback.toys.com . Like us on
facebook https //www.facebook.com/toys4Pets.Visit our site here www.123toys222222.com and and save 7 of your next purchase
use code save7 at checkout. Call us 987 654 3210. as email_text
;



/* cache the data into memory */
cache table v1;

/* verify the sample data */
select * from v1;
Image for post

And the next step is actual SQL that parses the above sample data and extract all of the desired patterned urls and return that in single row as comma separated list


-- Parse the string column that is of our interest
-- 1. Replace spaces that are right after interested keywords like http, www with tilde '~' a special character that we can distinguish later
-- 2. Split the text column with space as delimiter
-- 3. Select only the rows that are having interested keywords like http, www etc
-- 4. To avoid multiple rows result, group those based on the key elements and concatenate the test column that was parsed


select
  src.email_num,
  src.sender,
  src.receiver,
  replace( concat_ws(', ', collect_list(email_text_arr_str)), '~', ' ' ) as email_text_arr_str
from
  (
    select
      v.*,
      explode(
        split(
          replace( replace(v.email_text, 'http ', 'http~'), 'https ', 'https~' ), ' '
        )
      ) as email_text_arr_str
    from
      v1 as v
      order by v.email_num
  ) as src
where
  email_text_arr_str like '%http%' or email_text_arr_str like '%www.%'
  group by 1, 2, 3
;

And actual result is here, with all of the urls in a single row as a comma separated list.

Image for post

As the data is so unstructured, there are couple of quotes (‘’), and extra suffixes in the resulting list of urls. But these can be ignored by doin further transformations.

Comments

Popular posts from this blog

SQL to pivot rows to columns

Spark SQL - Performance Tuning tips