Parsing a long text and fetching desired pattern strings using SPARK SQL
One of my co-workers from analyst group has approached me for helping with a real-time problem that she was facing in extracting all matching words/substrings within a long text column using spark SQL. The long text is basically an email body that a seller sends out to his/her customers with many urls like their store webpage, facebook page etc. So first thing obviously she tried was using regex_extract built-in function. SET spark. sql .parser.escapedStringLiterals= true ; select REGEXP_EXTRACT( lower ( ‘ Thanks for your purchase Leave your feedback “ http //feedback.abc.com ” . Like us on facebook “ https //www.facebook.com/abcPets ". Visit our site here “www.abcpets.com" and and save 7 of your next purchase use code save7 at checkout. Call us 123 456 7890 . ’ ), ’ ((http //|https //|http://|https://|(www.)) \ w+ \ . \ w+) ' , 0 ) AS URL_PATTERN_1 ; But the result was only first matching string as shown below: URL_PATTERN_1 — — — — — — — — — - http...