Parsing a long text and fetching desired pattern strings using SPARK SQL
One of my co-workers from analyst group has approached me for helping with a real-time problem that she was facing in extracting all matching words/substrings within a long text column using spark SQL.
The long text is basically an email body that a seller sends out to his/her customers with many urls like their store webpage, facebook page etc.
So first thing obviously she tried was using regex_extract built-in function.
SET spark.sql.parser.escapedStringLiterals=true; select REGEXP_EXTRACT(lower(‘Thanks for your purchase Leave your feedback “ http //feedback.abc.com” . Like us on facebook “https //www.facebook.com/abcPets". Visit our site here “www.abcpets.com" and and save 7 of your next purchase use code save7 at checkout. Call us 123 456 7890.’),’((http //|https //|http://|https://|(www.))\w+\.\w+)',0) AS URL_PATTERN_1 ;
But the result was only first matching string as shown below:
URL_PATTERN_1
— — — — — — — — — -
http //feedback.abc
So, to mitigate this the alternate solution was to either write UDF or leverage some other builtin functions in tandem to get the desired result. So, here is my solution without the need to write UDF.
To illustrate this solution, first we need some sample data. So here it is:
/* Creating sample data for testing */ create or replace temporary view v1 as select 1 as email_num, ‘Sender1’ as sender, ‘Receiver1’ as receiver, ‘Thanks for your purchase Leave your feedback “ http //feedback.abc.com” . Like us on facebook https //www.facebook.com/abcPets.Visit our site here www.abcpets.com and and save 7 of your next purchase use code save7 at checkout. Call us 123 456 7890.’ as email_text union select 2 as email_num, ‘Sender2’ as sender, ‘Receiver2’ as receiver, ‘Thanks for your “ http //feedback.toys.com” purchase Leave your feedback “ http //feedback.toys.com” . Like us on facebook https //www.facebook.com/toys4Pets.Visit our site here www.123toys222222.com and and save 7 of your next purchase use code save7 at checkout. Call us 987 654 3210.’ as email_text ; /* cache the data into memory */ cache table v1; /* verify the sample data */ select * from v1;

And the next step is actual SQL that parses the above sample data and extract all of the desired patterned urls and return that in single row as comma separated list
-- Parse the string column that is of our interest -- 1. Replace spaces that are right after interested keywords like http, www with tilde '~' a special character that we can distinguish later -- 2. Split the text column with space as delimiter -- 3. Select only the rows that are having interested keywords like http, www etc -- 4. To avoid multiple rows result, group those based on the key elements and concatenate the test column that was parsed select src.email_num, src.sender, src.receiver, replace( concat_ws(', ', collect_list(email_text_arr_str)), '~', ' ' ) as email_text_arr_str from ( select v.*, explode( split( replace( replace(v.email_text, 'http ', 'http~'), 'https ', 'https~' ), ' ' ) ) as email_text_arr_str from v1 as v order by v.email_num ) as src where email_text_arr_str like '%http%' or email_text_arr_str like '%www.%' group by 1, 2, 3 ;
And actual result is here, with all of the urls in a single row as a comma separated list.

As the data is so unstructured, there are couple of quotes (‘’), and extra suffixes in the resulting list of urls. But these can be ignored by doin further transformations.


Comments
Post a Comment