Select substring of multiple rows with similar values

89 Views Asked by At

I am trying to select same substring of multiple rows without specifying the substring first.
Basically find the common part in a string. The common parts in a string range from 2 to 5 words.

Here is what I am trying to achieve:

The goal is to find the longest/last common denominator of multiple rows, creating a group / a set. The substring at the end of a string that is unique to a single row in a set should be trimmed.

Rows with that have same substring from left are considered part of a set.

"Item A 1" and "Item A 2" are part of "Item A" set.

"Item B 1" and "Item B 2" are part of "Item B" set.

The sets can include multiple parts from string.

"Item A A 1" and "Item A A 2" are part of "Item A A" set.

"Item A B 1" and "Item A B 2" are part of "Item A B" set.

Also value of the row might be just the name of a set.

"Item A" and "Item A 1" are part of "Item A" set.

"Item A B" and "Item A B 1" are part of "Item A B" set.

id item set
1 DEP Item 1 DEP Item
2 DEP Item 10 DEP Item
3 DEP Item 11 DEP Item
4 DEP Item 24 DEP Item
5 DEP G1 Item B 1 DEP G1 Item B
6 DEP G1 Item B 10 DEP G1 Item B
7 DEP G1 Item B 11 DEP G1 Item B
8 DEP G1 Item B 24 DEP G1 Item B
9 DEP A Item B 1 1 A DEP A Item B 1
10 DEP A Item B 1 10 DEP A Item B 1
11 DEP A Item B 1 11 B DEP A Item B 1
12 DEP A Item B 1 24 A B C DEP A Item B 1

Here is sample data and a query I've been trying, which sort of gives correct results, but not in the required way.

CREATE TABLE #temp (
    id INT,
    item NVARCHAR(50)
);


INSERT INTO #temp (id, item) VALUES
(1,'DEP Item 1'),
(2,'DEP Item 10'),
(3,'DEP Item 11'),
(4,'DEP Item 24'),
(5,'DEP G1 Item B 1'),
(6,'DEP G1 Item B 10'),
(7,'DEP G1 Item B 11'),
(8,'DEP G1 Item B 24'),
(9,'DEP A Item B 1 1 A'),
(10,'DEP A Item B 1 10'),
(11,'DEP A Item B 1 11 B'),
(12,'DEP A Item B 1 24 A B C')

select *,
CASE WHEN LEN(item)-LEN(REPLACE(item, ' ', '')) < 1 THEN item
    ELSE LEFT(item, CHARINDEX(' ', item,
        CHARINDEX(' ', item)+1))
    end,
CASE WHEN LEN(item)-LEN(REPLACE(item, ' ', '')) < 2 THEN item
    ELSE LEFT(item, CHARINDEX(' ', item,
        CHARINDEX(' ', item,
        CHARINDEX(' ', item)+1)+1))
    end,
CASE WHEN LEN(item)-LEN(REPLACE(item, ' ', '')) < 4 THEN item
    ELSE LEFT(item, CHARINDEX(' ', item,
        CHARINDEX(' ', item,
        CHARINDEX(' ', item,
        CHARINDEX(' ', item)+1)+1)+1))
    end,
CASE WHEN LEN(item)-LEN(REPLACE(item, ' ', '')) < 5 THEN item
    ELSE LEFT(item, CHARINDEX(' ', item,
        CHARINDEX(' ', item,
        CHARINDEX(' ', item,
        CHARINDEX(' ', item,
        CHARINDEX(' ', item)+1)+1)+1)+1))
    end
from #temp
2

There are 2 best solutions below

0
ValNik On BEST ANSWER

First, we make up all possible combinations of word sets for each id(item) where order of words is remains - this is path's.
6, DEP G1 Item B 1 ->(DEP), (DEP,G1), (DEP,G1,Item), (DEP,G1,Item,B), (DEP,G1, Item,B,1)
9, DEP G1 Item B 11->(DEP), (DEP,G1), (DEP,G1,Item), (DEP,G1,Item,B), (DEP,G1, Item,B,11)

Then select rows with common path's - thru comparing path's for every item with all item's path's - select rows with inner self join on t1.path=t2.path and t1.id<>t2.id.

And last - take greatest path's (order by count of words in path desc) for items.

See example

with r as(
  select id,1 n,item
     ,case when charindex(' ',item)>0 then  
        trim(substring(item,charindex(' ',item)+1,100))
      else ''
      end rest
     ,cast(case when charindex(' ',item)>0 then  
        trim(substring(item,1,charindex(' ',item)-1))
      else trim(item)
      end  as varchar(100)) path
from #temp
  union all
  select id,n+1 n,item
     ,case when charindex(' ',rest)>0 then  
        trim(substring(rest,charindex(' ',rest)+1,100))
      else ''
      end rest
     ,cast(concat(path,','
       ,case when charindex(' ',rest)>0 then  
        trim(substring(rest,1,charindex(' ',rest)-1))
        else trim(rest)
        end) as varchar(100)) path
from r where len(rest)>0
)
select s.path,n,t.id,t.item
from #temp t 
left join (
   select  t1.id,t1.path,t1.n
     ,row_number()over(partition by t1.id order by t1.n desc) rn
   from r t1
   inner join r t2 on t1.path=t2.path and t1.id<>t2.id
  ) s on s.rn=1 and s.id=t.id
order by n,path,t.id

output is

path (set) n id item
DEP 1 1 DEP
DEP,Item 2 2 DEP Item 1
DEP,Item 2 3 DEP Item 10
DEP,Item 2 5 DEP Item 24
DEP,G1,Item,B 4 6 DEP G1 Item B 1
DEP,G1,Item,B 4 8 DEP G1 Item B 11
DEP,G1,Item,B 4 9 DEP G1 Item B 24
DEP,A,Item,B,1 5 11 DEP A Item B 1 10
DEP,A,Item,B,1 5 13 DEP A Item B 1 24 A B C
DEP,A,Item,B,1,1,A 7 10 DEP A Item B 1 1 A
DEP,A,Item,B,1,1,A 7 14 DEP A Item B 1 1 A B C
null null 15 Hello world

There item id=15 not have common words with other rows.

on test data

CREATE TABLE #temp (id INT, item NVARCHAR(50));
INSERT INTO #temp (id, item) VALUES
(1,'DEP'),
(2,'DEP     Item 1'),
(3,'DEP Item 10'),
--(4,'DEP Item 11'),
(5,'DEP Item 24'),
(6,'DEP G1 Item B 1'),
--(7,'DEP G1 Item B 10'),
(8,'DEP G1 Item B 11'),
(9,'DEP G1 Item B 24'),
(10,'DEP A Item B 1 1 A'),
(11,'DEP A Item B 1 10'),
--(12,'DEP A Item B 1 11 B'),
(13,'DEP A Item B 1 24 A B C'),
(14,'DEP A Item B 1 1 A B C'),
(15,'Hello world')

;

Fiddle

0
SQLpro On

With:

CREATE TABLE Here_is_what_I_am_trying_to_achieve 
(id INT, item VARCHAR(32))

INSERT INTO Here_is_what_I_am_trying_to_achieve VALUES
(1 ,'DEP Item 1'),
(8 ,'DEP G1 Item B 24'),
(12 ,'DEP A Item B 1 24 A B C');

Solution is:

WITH 
T AS
(
SELECT id, value, 
       ROW_NUMBER() OVER(PARTITION BY id ORDER BY (SELECT 1)) AS RN, 
       COUNT(*)  OVER(PARTITION BY id) AS N
FROM   Here_is_what_I_am_trying_to_achieve
       CROSS APPLY STRING_SPLIT(item, ' ')
)
SELECT id, STRING_AGG(value, ' ')
FROM   T
WHERE  RN <= CASE N WHEN 3 THEN 2
                    WHEN 4 THEN 2
                    WHEN 5 THEN 4
                    WHEN 6 THEN 4
                    ELSE 5
             END
GROUP  BY id