How to found duplicate entries from json string in PL SQL Oracle

33 Views Asked by At

I tried different ways to find the row that has a duplicate key but without success. The goal is to find the row that has a duplicate key

This is the current row in the table I'm trying to retrieve.This is the duplicate key 12/07/2023":1000.

SELECT 1, '{"fl":{"12/07/2023":1000,"12/07/2023":1000,"12/31/2023":1500,"12/13/2023":2000},"TIME1":{"12/07/2023":0,"12/31/2023":5000}}'  FROM DUAL UNION ALL


 CREATE TABLE test_json1 (id int, json_date  clob  );

      insert into test_json1 (id,json_date)
      SELECT 2, '{"fl":{"12/07/2023":2000,"12/08/2023":3000,"12/31/2023":4000,"12/13/2023":5000},"TIME1":{"12/07/2023":0,"12/31/2023":6000}}'  FROM DUAL UNION ALL
      SELECT 1, '{"fl":{"12/07/2023":1000,"12/07/2023":1000,"12/31/2023":1500,"12/13/2023":2000},"TIME1":{"12/07/2023":0,"12/31/2023":5000}}'  FROM DUAL UNION ALL
      SELECT 1,'{"fl":{"12/07/2023":150,"12/31/2023":250,"12/13/2023":350},"TIME1":{"12/07/2023":0,"12/31/2023":100}}'  FROM DUAL;
 
1

There are 1 best solutions below

2
MT0 On

This is a difficult problem because while the JSON specification (PDF) states that:

6 Objects

An object structure is represented as a pair of curly bracket tokens surrounding zero or more name/value pairs. A name is a string. A single colon token follows each name, separating the name from the value. A single comma token separates a value from a following name. The JSON syntax does not impose any restrictions on the strings used as names, does not require that name strings be unique, and does not assign any significance to the ordering of name/value pairs. These are all semantic considerations that may be defined by JSON processors or in specifications defining specific uses of JSON for data interchange.

Therefore your JSON may technically be valid but most JSON processors will not keep duplicate keys and will overwrite the first instance of a duplicate with the second instance of the duplicate.

If you try to extract the keys using JSON_DATAGUIDE:

SELECT t.id,
       dg.structure
FROM   test_json1 t
       CROSS APPLY (
         SELECT JSON_DATAGUIDE(t.json_date) AS structure
         FROM   DUAL
       ) dg;

Then the output is:

ID STRUCTURE
2 [{"o:path":"$","type":"object","o:length":128},{"o:path":"$.fl","type":"object","o:length":128},{"o:path":"$.fl."12/07/2023"","type":"number","o:length":4},{"o:path":"$.fl."12/08/2023"","type":"number","o:length":4},{"o:path":"$.fl."12/13/2023"","type":"number","o:length":4},{"o:path":"$.fl."12/31/2023"","type":"number","o:length":4},{"o:path":"$.TIME1","type":"object","o:length":64},{"o:path":"$.TIME1."12/07/2023"","type":"number","o:length":1},{"o:path":"$.TIME1."12/31/2023"","type":"number","o:length":4}]
1 [{"o:path":"$","type":"object","o:length":128},{"o:path":"$.fl","type":"object","o:length":128},{"o:path":"$.fl."12/07/2023"","type":"number","o:length":4},{"o:path":"$.fl."12/13/2023"","type":"number","o:length":4},{"o:path":"$.fl."12/31/2023"","type":"number","o:length":4},{"o:path":"$.TIME1","type":"object","o:length":64},{"o:path":"$.TIME1."12/07/2023"","type":"number","o:length":1},{"o:path":"$.TIME1."12/31/2023"","type":"number","o:length":4}]
1 [{"o:path":"$","type":"object","o:length":128},{"o:path":"$.fl","type":"object","o:length":64},{"o:path":"$.fl."12/07/2023"","type":"number","o:length":4},{"o:path":"$.fl."12/13/2023"","type":"number","o:length":4},{"o:path":"$.fl."12/31/2023"","type":"number","o:length":4},{"o:path":"$.TIME1","type":"object","o:length":64},{"o:path":"$.TIME1."12/07/2023"","type":"number","o:length":1},{"o:path":"$.TIME1."12/31/2023"","type":"number","o:length":4}]

and {"o:path":"$.fl.\"12\/07\/2023\"","type":"number","o:length":4} only appears once (the duplicate is discarded).

You can get the values using:

SELECT t.id,
       j.pos,
       j.value
FROM   test_json1 t
       CROSS APPLY JSON_TABLE(
         t.json_date,
         '$.fl.*'
         COLUMNS (
           pos   FOR ORDINALITY,
           value PATH '$'
         )
       ) j;

Which outputs:

ID POS VALUE
2 1 2000
2 2 3000
2 3 4000
2 4 5000
1 1 1000
1 2 1000
1 3 1500
1 4 2000
1 1 150
1 2 250
1 3 350

And the duplicate value is extracted but there is no JSON expression to extract the key corresponding to a particular position within the object.

You can reverse engineer it, to some degree, using PL/SQL functions:

CREATE FUNCTION get_key(
  pos  IN PLS_INTEGER,
  json IN CLOB
) RETURN VARCHAR2 
AS
  doc_keys JSON_KEY_LIST;
BEGIN
  doc_keys := JSON_OBJECT_T.PARSE ( json ).GET_KEYS;
  RETURN doc_keys( pos );
END get_key;
/
CREATE FUNCTION num_keys(
  json IN CLOB
) RETURN VARCHAR2 
AS
  doc_keys JSON_KEY_LIST;
BEGIN
  doc_keys := JSON_OBJECT_T.PARSE ( json ).GET_KEYS;
  RETURN doc_keys.COUNT;
END num_keys;
/

Then:

SELECT t.id,
       j.pos,
       CASE
       WHEN pos <= num_keys(t.json_date.fl)
       THEN get_key(j.pos, t.json_date.fl)
       END AS key,
       j.value
FROM   test_json1 t
       CROSS APPLY JSON_TABLE(
         t.json_date,
         '$.fl.*'
         COLUMNS (
           pos   FOR ORDINALITY,
           value PATH '$'
         )
       ) j;

Outputs:

ID POS KEY VALUE
2 1 12/07/2023 2000
2 2 12/08/2023 3000
2 3 12/31/2023 4000
2 4 12/13/2023 5000
1 1 12/07/2023 1000
1 2 12/31/2023 1000
1 3 12/13/2023 1500
1 4 null 2000
1 1 12/07/2023 150
1 2 12/31/2023 250
1 3 12/13/2023 350

But then you can see that the PL/SQL parser has discarded the duplicate key, just like JSON_DATAGUIDE did, and the keys for the second row of data do not correspond to the values.

So, while you should solve this problem using a JSON parser, in practice most implementations of JSON parsers are going to discard duplicate keys.

You could try solving it using regular expressions to find a key followed by a number and then a repeat of that key and number within the same object:

SELECT id,
       REGEXP_SUBSTR(
         json_date,
         '"(([^"]|\\")*)":\s*(\d+)[^}]*,\s*"\1":\s*\3',
         1,
         1,
         NULL,
         1
       ) AS key,
       REGEXP_SUBSTR(
         json_date,
         '"(([^"]|\\")*)":\s*(\d+)[^}]*,\s*"\1":\s*\3',
         1,
         1,
         NULL,
         3
       ) AS value
FROM   test_json1
WHERE  REGEXP_LIKE(json_date, '"(([^"]|\\")*)":\s*(\d+)[^}]*,\s*"\1":\s*\3')

Which outputs:

ID KEY VALUE
1 12/07/2023 1000

Note: this will only find the first duplicate in a row and there may be edge cases where the regular expression does not work.

fiddle