I have a table that has a column that holds an html string that may or may not be well formed (tried to use xmltable route and it didn't work) which is why I am trying to use the following sql syntax. I am trying to create a query that uses REGEXP_SUBSTR, LEVEL and CONNECT BY LEVEL based on examples I found online but I am not able to get the results I would expect:
I created an example query below:
with qry as (
select 1 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div>' as html_string from dual
union
select 2 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div><a href="#" data-jump="114">Other Service</a></div>' as html_string from dual
)
SELECT ID,
REGEXP_SUBSTR(html_string, '<a.*?>(.*?).</a>', 1, LEVEL, NULL, 1) as contents,
REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, LEVEL, NULL, 1) as data_jump
FROM qry
CONNECT BY LEVEL <= REGEXP_COUNT(html_string, '<a.*?>(.*?).</a>')
By running this query, I would expect 7 rows to come back:
| ID | Contents | Data_Jump |
|---|---|---|
| 1 | Client Servic | 111 |
| 1 | Customer Servic | 112 |
| 1 | Tecnical Servic | 113 |
| 2 | Other Servic | 114 |
| 2 | Tecnical Servic | 113 |
| 2 | Other Servic | 114 |
| 2 | Customer Servic | 112 |
Instead its coming back with 22 rows.
| ID | Contents | Data_Jump |
|---|---|---|
| 1 | Client Servic | 111 |
| 1 | Customer Servic | 112 |
| 1 | Tecnical Servic | 113 |
| 2 | Other Servic | 114 |
| 2 | Tecnical Servic | 113 |
| 2 | Other Servic | 114 |
| 2 | Customer Servic | 112 |
| 1 | Tecnical Servic | 113 |
| 2 | Other Servic | 114 |
| 2 | Tecnical Servic | 113 |
| 2 | Other Servic | 114 |
| 2 | Client Servic | 111 |
| 1 | Customer Servic | 112 |
| 1 | Tecnical Servic | 113 |
| 2 | Other Servic | 114 |
| 2 | Tecnical Servic | 113 |
| 2 | Other Servic | 114 |
| 2 | Customer Servic | 112 |
| 1 | Tecnical Servic | 113 |
| 2 | Other Servic | 114 |
| 2 | Tecnical Servic | 113 |
| 2 | Other Servic | 114 |
I don't know where the problem is that I am getting duplicates. On the actual production table there are a lot of rows so the query just spins and never returns. I suspect its because a lot of duplicates are being generated. Can anyone tell me how to fix the above query so that I can try to transport it onto my actual query?
Thanks,
You need to include the ID in the connect-by clause; but for this to wotk you also need to include a non-deterministic function call; e.g.:
So
db<>fiddle
Or you could use recursive subquery factoring instead of a hierarchical query, which is maybe easier to understand:
db<>fiddle
Or even, preserving the last character of the 'contents' by removing the stray
.from the regex as @Gary_W pointed out:db<>fiddle