I am trying to select same substring of multiple rows without specifying the substring first.
Basically find the common part in a string. The common parts in a string range from 2 to 5 words.
Here is what I am trying to achieve:
The goal is to find the longest/last common denominator of multiple rows, creating a group / a set. The substring at the end of a string that is unique to a single row in a set should be trimmed.
Rows with that have same substring from left are considered part of a set.
"Item A 1" and "Item A 2" are part of "Item A" set.
"Item B 1" and "Item B 2" are part of "Item B" set.
The sets can include multiple parts from string.
"Item A A 1" and "Item A A 2" are part of "Item A A" set.
"Item A B 1" and "Item A B 2" are part of "Item A B" set.
Also value of the row might be just the name of a set.
"Item A" and "Item A 1" are part of "Item A" set.
"Item A B" and "Item A B 1" are part of "Item A B" set.
| id | item | set |
|---|---|---|
| 1 | DEP Item 1 | DEP Item |
| 2 | DEP Item 10 | DEP Item |
| 3 | DEP Item 11 | DEP Item |
| 4 | DEP Item 24 | DEP Item |
| 5 | DEP G1 Item B 1 | DEP G1 Item B |
| 6 | DEP G1 Item B 10 | DEP G1 Item B |
| 7 | DEP G1 Item B 11 | DEP G1 Item B |
| 8 | DEP G1 Item B 24 | DEP G1 Item B |
| 9 | DEP A Item B 1 1 A | DEP A Item B 1 |
| 10 | DEP A Item B 1 10 | DEP A Item B 1 |
| 11 | DEP A Item B 1 11 B | DEP A Item B 1 |
| 12 | DEP A Item B 1 24 A B C | DEP A Item B 1 |
Here is sample data and a query I've been trying, which sort of gives correct results, but not in the required way.
CREATE TABLE #temp (
id INT,
item NVARCHAR(50)
);
INSERT INTO #temp (id, item) VALUES
(1,'DEP Item 1'),
(2,'DEP Item 10'),
(3,'DEP Item 11'),
(4,'DEP Item 24'),
(5,'DEP G1 Item B 1'),
(6,'DEP G1 Item B 10'),
(7,'DEP G1 Item B 11'),
(8,'DEP G1 Item B 24'),
(9,'DEP A Item B 1 1 A'),
(10,'DEP A Item B 1 10'),
(11,'DEP A Item B 1 11 B'),
(12,'DEP A Item B 1 24 A B C')
select *,
CASE WHEN LEN(item)-LEN(REPLACE(item, ' ', '')) < 1 THEN item
ELSE LEFT(item, CHARINDEX(' ', item,
CHARINDEX(' ', item)+1))
end,
CASE WHEN LEN(item)-LEN(REPLACE(item, ' ', '')) < 2 THEN item
ELSE LEFT(item, CHARINDEX(' ', item,
CHARINDEX(' ', item,
CHARINDEX(' ', item)+1)+1))
end,
CASE WHEN LEN(item)-LEN(REPLACE(item, ' ', '')) < 4 THEN item
ELSE LEFT(item, CHARINDEX(' ', item,
CHARINDEX(' ', item,
CHARINDEX(' ', item,
CHARINDEX(' ', item)+1)+1)+1))
end,
CASE WHEN LEN(item)-LEN(REPLACE(item, ' ', '')) < 5 THEN item
ELSE LEFT(item, CHARINDEX(' ', item,
CHARINDEX(' ', item,
CHARINDEX(' ', item,
CHARINDEX(' ', item,
CHARINDEX(' ', item)+1)+1)+1)+1))
end
from #temp
First, we make up all possible combinations of word sets for each id(item) where order of words is remains - this is
path's.6, DEP G1 Item B 1->(DEP), (DEP,G1), (DEP,G1,Item), (DEP,G1,Item,B), (DEP,G1, Item,B,1)9, DEP G1 Item B 11->(DEP), (DEP,G1), (DEP,G1,Item), (DEP,G1,Item,B), (DEP,G1, Item,B,11)Then select rows with common path's - thru comparing path's for every item with all item's path's - select rows with inner self join on t1.path=t2.path and t1.id<>t2.id.
And last - take greatest path's (order by count of words in path desc) for items.
See example
output is
set)There item id=15 not have common words with other rows.
on test data
Fiddle