SQL Substring from a column of strings

90 Views Asked by At

I have a table with around 100,000 articles.

Each article has a description.

short example:

ARTIKELNR DESCRIPTION
104009400031900 S235JRG2C+C EN 10277/10 80 x 8 mm Blanker Flachstahl
104009800002950 Werksattest nach EN 10204
105009400092360 EN 10060 Inox Rund Ø 28 mm WNr1.4034 warmgefertigt
105009400068571 90 mm WNr1.4404 warmgefertigt, EN 10060, geschmiedet
105009400004420 WNr1.4301 Inox Vierkantstahl 6 x 6 mm EN 10278, blankgezogen
105009400008800 WNr1.4301 Inox Band geschnitten 25 x 4 mm, Breitentol. 25 m
105009400068600 WNr1.4112 Inox Rund 8.00 mm nach EN 10088-3 + A+C+SL, gesch
105009400068700 WNr1.4301 Inox Rund 36.00 mm nach EN 10088-3, geschliffen, T
105009400068800 WNr1.4104 Inox Rund 25.00 mm nach EN 10088-3, geschliffen, T
105009400010035 WNr1.4404 Inox Vierkantstahl 30 x 30 mm, gezogen EN 10278

None have the same format.

What I would like to do is extract the "NORM" from the description.

So, I would like to have the following result

ARTIKELNR NORM
104009400031900 EN 10277/10
104009800002950 EN 10204
105009400092360 EN 10060
105009400068571 EN 10060
105009400004420 EN 10278
105009400008800
105009400068600 EN 10088-3
105009400068700 EN 10088-3
105009400068800 EN 10088-3
105009400010035 EN 10278
3

There are 3 best solutions below

3
Yoji On

This can be a hint:

SELECT 
    ARTIKELNR,
    CASE 
        WHEN CHARINDEX('EN ', DESCRIPTION) > 0 THEN
            SUBSTRING(DESCRIPTION, CHARINDEX('EN ', DESCRIPTION), PATINDEX('%[0-9/.]%', SUBSTRING(DESCRIPTION, CHARINDEX('EN ', DESCRIPTION), LEN(DESCRIPTION))))
        ELSE
            ''
    END AS NORM
FROM 
    YourTableName

This query assumes that "EN " followed by digits, dots, or slashes uniquely identifies the start of the NORM.

7
Thom A On

In the most recent versions of SQL Server, this is actually somewhat simple with builtin functionality. YOu can use STRING_SPLIT to split the value, and ensure you return the ordinal positions so that you can use LEAD or LAG get the next/last delimited value and filter to your EN values. Finally you can use TRIM to remove those trailing commas:

WITH CTE AS(
    SELECT V.ARTIKELNR,
           V.DESCRIPTION,
           SS.value,
           LEAD(SS.value) OVER (PARTITION BY V.ARTIKELNR ORDER BY SS.ordinal) AS NextValue
    FROM (VALUES(104009400031900,N'S235JRG2C+C EN 10277/10 80 x 8 mm Blanker Flachstahl'),
                (104009800002950,N'Werksattest nach EN 10204'),
                (105009400092360,N'EN 10060 Inox Rund Ø 28 mm WNr1.4034 warmgefertigt'),
                (105009400068571,N'90 mm WNr1.4404 warmgefertigt, EN 10060, geschmiedet'),
                (105009400004420,N'WNr1.4301 Inox Vierkantstahl 6 x 6 mm EN 10278, blankgezogen'),
                (105009400008800,N'WNr1.4301 Inox Band geschnitten 25 x 4 mm, Breitentol. 25 m'),
                (105009400068600,N'WNr1.4112 Inox Rund 8.00 mm nach EN 10088-3 + A+C+SL, gesch'),
                (105009400068700,N'WNr1.4301 Inox Rund 36.00 mm nach EN 10088-3, geschliffen, T'),
                (105009400068800,N'WNr1.4104 Inox Rund 25.00 mm nach EN 10088-3, geschliffen, T'),
                (105009400010035,N'WNr1.4404 Inox Vierkantstahl 30 x 30 mm, gezogen EN 10278'))V(ARTIKELNR,DESCRIPTION)
         CROSS APPLY STRING_SPLIT(V.DESCRIPTION,' ',1) SS)
SELECT C.ARTIKELNR,
       TRIM(',' FROM CONCAT_WS(' ',C.value,C.NextValue)) AS Norm
FROM CTE C
WHERE C.value = 'EN';

db<>fiddle

In 2012 (reminder, this became completely unsupported in 2022) it's a little harder. You can replace STRING_SPLIT with a different non-builtin splitter that returns the ordinal position, such as DelimitedN4KSplit_LEAD (N4K as your strings appear to be an nvarchar). You still have access to LEAD, but not TRIM, so you'll need to use REPLACE and RTRIM. This gives you something like this:

WITH CTE AS(
    SELECT V.ARTIKELNR,
           V.DESCRIPTION,
           DS.item,
           LEAD(DS.Item) OVER (PARTITION BY V.ARTIKELNR ORDER BY DS.ItemNumber) AS nextitem
    FROM (VALUES(104009400031900,N'S235JRG2C+C EN 10277/10 80 x 8 mm Blanker Flachstahl'),
                (104009800002950,N'Werksattest nach EN 10204'),
                (105009400092360,N'EN 10060 Inox Rund Ø 28 mm WNr1.4034 warmgefertigt'),
                (105009400068571,N'90 mm WNr1.4404 warmgefertigt, EN 10060, geschmiedet'),
                (105009400004420,N'WNr1.4301 Inox Vierkantstahl 6 x 6 mm EN 10278, blankgezogen'),
                (105009400008800,N'WNr1.4301 Inox Band geschnitten 25 x 4 mm, Breitentol. 25 m'),
                (105009400068600,N'WNr1.4112 Inox Rund 8.00 mm nach EN 10088-3 + A+C+SL, gesch'),
                (105009400068700,N'WNr1.4301 Inox Rund 36.00 mm nach EN 10088-3, geschliffen, T'),
                (105009400068800,N'WNr1.4104 Inox Rund 25.00 mm nach EN 10088-3, geschliffen, T'),
                (105009400010035,N'WNr1.4404 Inox Vierkantstahl 30 x 30 mm, gezogen EN 10278'))V(ARTIKELNR,DESCRIPTION)
         CROSS APPLY fn.DelimitedSplitN4K_LEAD(V.DESCRIPTION,' ') DS)
SELECT C.ARTIKELNR,
       RTRIM(REPLACE(CONCAT(C.item,' ',C.nextitem),',',' ')) AS Norm
FROM CTE C
WHERE C.item = 'EN';
0
Yitzhak Khabinsky On

Please try the following solution leveraging SQL Server XML and XQuery functionality.

It will work starting from SQL Server 2012 onwards.

Notable points:

  • CROSS APPLY is tokenizing DESCRIPTION column as XML.
  • XQuery .query() method is looking for the "EN" token via FLWOR expression and the following token via XPath predicate ...r[position()=($pos, $pos + 1)].
  • XQuery .value() method gives us a desired output string.
  • REPLACE() function removes optional trailing comma.

SQL

-- DDL and sample data population, start
DECLARE @tbl TABLE (ID INT IDENTITY PRIMARY KEY, ARTIKELNR CHAR(15), DESCRIPTION NVARCHAR(1024));
INSERT INTO @tbl (ARTIKELNR, DESCRIPTION) VALUES
(104009400031900,N'S235JRG2C+C EN 10277/10 80 x 8 mm Blanker Flachstahl'),
(104009800002950,N'Werksattest nach EN 10204'),
(105009400092360,N'EN 10060 Inox Rund Ø 28 mm WNr1.4034 warmgefertigt'),
(105009400068571,N'90 mm WNr1.4404 warmgefertigt, EN 10060, geschmiedet'),
(105009400004420,N'WNr1.4301 Inox Vierkantstahl 6 x 6 mm EN 10278, blankgezogen'),
(105009400008800,N'WNr1.4301 Inox Band geschnitten 25 x 4 mm, Breitentol. 25 m'),
(105009400068600,N'WNr1.4112 Inox Rund 8.00 mm nach EN 10088-3 + A+C+SL, gesch'),
(105009400068700,N'WNr1.4301 Inox Rund 36.00 mm nach EN 10088-3, geschliffen, T'),
(105009400068800,N'WNr1.4104 Inox Rund 25.00 mm nach EN 10088-3, geschliffen, T'),
(105009400010035,N'WNr1.4404 Inox Vierkantstahl 30 x 30 mm, gezogen EN 10278');
-- DDL and sample data population, end

DECLARE @separator CHAR(1) = SPACE(1);

SELECT *
    , REPLACE(c.query('
          for $x in /root/r[text()="EN"]
          let $pos := count(root/r[. << $x]) + 1
          return data(/root/r[position()=($pos, $pos + 1)])
        ').value('text()[1]', 'NVARCHAR(30)')
        ,',', '') AS NORM
FROM @tbl AS t
CROSS APPLY (SELECT TRY_CAST('<root><r><![CDATA[' + 
    REPLACE(DESCRIPTION, @separator, ']]></r><r><![CDATA[') + 
    ']]></r></root>' AS XML)) AS t1(c)
ORDER BY ID;

Output

ID ARTIKELNR DESCRIPTION NORM
1 104009400031900 S235JRG2C+C EN 10277/10 80 x 8 mm Blanker Flachstahl EN 10277/10
2 104009800002950 Werksattest nach EN 10204 EN 10204
3 105009400092360 EN 10060 Inox Rund Ø 28 mm WNr1.4034 warmgefertigt EN 10060
4 105009400068571 90 mm WNr1.4404 warmgefertigt, EN 10060, geschmiedet EN 10060
5 105009400004420 WNr1.4301 Inox Vierkantstahl 6 x 6 mm EN 10278, blankgezogen EN 10278
6 105009400008800 WNr1.4301 Inox Band geschnitten 25 x 4 mm, Breitentol. 25 m NULL
7 105009400068600 WNr1.4112 Inox Rund 8.00 mm nach EN 10088-3 + A+C+SL, gesch EN 10088-3
8 105009400068700 WNr1.4301 Inox Rund 36.00 mm nach EN 10088-3, geschliffen, T EN 10088-3
9 105009400068800 WNr1.4104 Inox Rund 25.00 mm nach EN 10088-3, geschliffen, T EN 10088-3
10 105009400010035 WNr1.4404 Inox Vierkantstahl 30 x 30 mm, gezogen EN 10278 EN 10278